From Scratch: Implementing an Autoregressive Image Generation Pipeline with Binary Spherical Quantization

Last week, I implemented a complete autoregressive (AR) image generation pipeline from scratch — patch-based tokenization, Binary Spherical Quantization (BSQ), a causal transformer, and arithmetic coding for lossless compression. This post details the architecture, key implementation challenges, solutions that proved effective, lessons learned, and connections to ongoing research in visual generation.

What We Built

The system has four components that chain together:

Patch tokenizer — compress the image into a grid of discrete integer tokens
BSQ quantizer — map continuous patch embeddings to binary codes
Autoregressive transformer — learn to predict the next token given all previous tokens
Arithmetic coder — compress token sequences using learned probability distributions

Image → Patch Tokenizer → Integer Tokens → AR Transformer → Generated Tokens → Decoded Image
                                                    ↓
                                          Arithmetic Coder → Compressed Bytes

Each piece is independently interesting. Together they form a miniature version of what powers systems like DALL-E, VQGAN, and LlamaGen.

Part 1: Patch-Level Autoencoder

The Idea

Rather than processing pixels directly — computationally brutal at scale — we first compress the image into a grid of patch embeddings. A 150×100 image becomes a 30×20 grid of learned vectors. This is the same intuition behind Vision Transformers (ViTs) and the tokenization stage in most modern image generation systems.

Attempts and Results

Attempt 1: Linear patchification only. A strided convolution mapping each patch to a latent vector, then a transposed convolution to reconstruct. Reconstruction was passable but blurry. Without interaction between patches, the model could not learn local structure across boundaries.

Attempt 2: Add nonlinearity and inter-patch convolution. Added GELU activations and a 3×3 convolution between patches before the bottleneck. This gave the encoder the ability to reason about neighboring patches — crucial for reconstructing edges, textures, and transitions.

The architecture that worked:

# Encoder
Conv2d(3 → 128, stride=patch_size) → GELU → Conv2d(128 → 128, 3×3)

# Decoder
Conv2d(128 → 128, 3×3) → GELU → ConvTranspose2d(128 → 3, stride=patch_size)

The Critical Hyperparameter: Patch Size

Patch size is the most consequential decision in this pipeline. Smaller patches (e.g., 5×5) yield a 30×20 grid of 600 tokens and dramatically superior reconstruction quality compared to larger patches (e.g., 25×25 → 6×4 grid, just 24 tokens), albeit at the cost of longer sequences for the downstream AR model.

Start with small patches. The reconstruction quality difference is dramatic.

Part 2: Binary Spherical Quantization (BSQ)

The Problem with Continuous Embeddings

AR models predict over discrete vocabularies — like next-word prediction in language models. Continuous patch embeddings cannot be predicted this way. We need to quantize them into integers.

The classic solution is Vector Quantization (VQ) — maintain a codebook of learned embedding vectors and map each patch to its nearest neighbor. VQ-VAE (van den Oord et al., 2017) established this pattern and it has been central to image generation ever since.

Why BSQ Is Better

VQ has a fundamental problem: codebook collapse. If the model learns to use only a small subset of codebook entries, the vocabulary becomes impoverished and generation quality suffers.

Binary Spherical Quantization (Zhao et al., 2024) sidesteps this elegantly:

Project the patch embedding down to a lower-dimensional bottleneck (e.g., 10 dimensions)
L2-normalize onto the unit hypersphere — uniform angular separation means every code is equally "far" from every other, naturally preventing collapse
Binarize each dimension to ±1

With 10 binary dimensions, you get 2¹⁰ = 1,024 possible tokens. The binarization uses a straight-through estimator to keep gradients flowing:

def diff_sign(x):
    sign = 2 * (x >= 0).float() - 1
    return x + (sign - x).detach()  # Forward: ±1. Backward: identity.

Implementation Pitfall: Apple Silicon Bitwise Bug

On Apple Silicon (M1/M2/M3), PyTorch's bitwise shift operators (<<, >>) produce incorrect results in MPS mode. The fix: replace bit shifting with exponentiation.

# Broken on MPS:
index = (binary_code << torch.arange(codebook_bits))

# Correct everywhere:
index = (binary_code * (2 ** torch.arange(codebook_bits)))

This is a known PyTorch bug (#147889).

Part 3: The Autoregressive Transformer

The Setup

With images tokenized as 30×20 grids of integers (0–1023), we train a transformer to predict the next token given all previous tokens — identical to language modeling, just with image tokens instead of words.

The architecture: a decoder-only transformer with token embeddings (1,024 vocab → 128 dims), a learned start token, positional embeddings, 6 transformer encoder layers with causal masking, and a linear projection head (128 → 1,024 logits).

Implementation Pitfall: The Silent Causality Failure

PyTorch's TransformerEncoder has a "fast path" optimization that silently ignores the causal attention mask under certain conditions. The model trains, the loss decreases — but the causality constraint is violated. The model cheats by looking at future tokens during training. At inference, where it cannot cheat, generation quality collapses entirely.

Two flags fix it:

self.transformer = torch.nn.TransformerEncoder(
    encoder_layer,
    num_layers=6,
    enable_nested_tensor=False,  # Disables the fast path that ignores your mask
)

# In forward():
out = self.transformer(emb, mask=mask, is_causal=True)

Never assume causality is enforced. Validate by inspecting attention patterns or testing with a known AR task — otherwise training appears successful while inference fails catastrophically.

Training Dynamics

Cross-entropy loss over 1,024-class token prediction, measured in bits-per-token. A random model scores ~10 bits/token (log₂(1,024)). A well-trained model reaches ~6–7 bits/token, corresponding to roughly 3,600–4,200 bits per 150×100 image — a substantial improvement over naive 10-bit encoding (6,000 bits).

15 epochs on a T4 GPU took approximately 2–3 hours. Implementing periodic checkpointing to persistent storage proved essential — Colab runtimes disconnect unexpectedly with no warning.

Part 4: Arithmetic Coding for Compression

Why Bit-Packing Is Not Enough

A naive approach: store each token as 10 bits. For 600 tokens: 6,000 bits = 750 bytes. About 10× compression vs raw JPEG.

But we have something better: a model that assigns probability distributions over each token. Arithmetic coding exploits these distributions. A token the model was 90% confident about takes far fewer than 10 bits to encode. The expected bits per token equals the model's cross-entropy loss.

Theoretical target: 600 × 6.5 bits ≈ 3,900 bits ≈ 488 bytes. Nearly 25× compression.

How It Works

Range coding maintains a current interval [low, high) starting at [0, 1). For each symbol, narrow the interval to the sub-range corresponding to the symbol's probability mass — using CDFs scaled to integer range (0 to 2¹⁶ − 1) for numerical stability. Output bits to disambiguate as the interval narrows.

The encoder uses a single teacher-forced forward pass to get all 600 distributions at once. The decoder must call the model autoregressively for each token — it cannot know the distribution for token n without first decoding tokens 0 through n−1.

Where This Research Is Now

Tokenization Is the Key Battleground

VQGAN (Esser et al., 2021) showed that perceptual and adversarial losses in the tokenizer dramatically improve generation quality. An MSE-only tokenizer produces blurry reconstructions — a GAN discriminator sharpens them considerably.

Open-MAGVIT2 (Luo et al., 2024) extended this with a codebook of 262,144 tokens (vs our 1,024) and improved training stability.

BSQ (Zhao et al., 2024) argues that binary codes with spherical normalization match or exceed VQ quality while being simpler to train and more codebook-efficient.

Autoregressive vs. Diffusion

	Autoregressive	Diffusion
Examples	LlamaGen, DALL-E (v1)	Stable Diffusion, DALL-E 3
Inference	Sequential — slower	Parallel — faster
Scaling	Matches LLM scaling laws	Different scaling behavior
Flexibility	Natural for mixed modalities	Excellent image quality

LlamaGen (Sun et al., 2024) showed that AR models trained at scale match diffusion model quality and scale predictably with model size — just like large language models. This has renewed serious interest in AR approaches after diffusion dominated 2022–2024.

MAR (Li et al., 2024) combines both: discrete tokens with a per-token continuous diffusion head, achieving state-of-the-art quality with faster inference than pure AR.

What I Would Do Differently

Perceptual loss in the tokenizer (highest impact) — perceptual loss combined with a GAN discriminator typically yields the largest qualitative improvement in sharpness and realism.
Larger codebook — 4,096 or 8,192 tokens captures significantly more detail at the cost of a harder prediction task.
Flash Attention — naive O(n²) attention over 600 tokens is the main training bottleneck. Flash Attention provides 3–5× speedup.
KV-caching for decompression — reduces sequential decoding from O(n²) to O(n). Standard in production AR systems.
More epochs — 15 epochs is a starting point. VQGAN-quality results typically require hundreds of epochs with careful learning rate schedules.

References

van den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). Neural Discrete Representation Learning (VQ-VAE). NeurIPS 2017. https://arxiv.org/abs/1711.00937
Esser, P., Rombach, R., & Ommer, B. (2021). Taming Transformers for High-Resolution Image Synthesis (VQGAN). CVPR 2021. https://arxiv.org/abs/2012.09841
Zhao, Y., Xiong, Y., & Krähenbühl, P. (2024). Image and Video Tokenization with Binary Spherical Quantization (BSQ). https://arxiv.org/abs/2406.07548
Sun, P., et al. (2024). Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation. https://arxiv.org/abs/2406.06525
Li, T., et al. (2024). Autoregressive Image Generation without Vector Quantization (MAR). NeurIPS 2024. https://arxiv.org/abs/2406.11838
Ballé, J., Laparra, V., & Simoncelli, E. (2017). End-to-end Optimized Image Compression. ICLR 2017. https://arxiv.org/abs/1611.01704
Minnen, D., Ballé, J., & Toderici, G. (2018). Joint Autoregressive and Hierarchical Priors for Learned Image Compression. NeurIPS 2018. https://arxiv.org/abs/1809.02736
Witten, I., Neal, R., & Cleary, J. (1987). Arithmetic Coding for Data Compression. Communications of the ACM. https://doi.org/10.1145/214762.214771
Luo, Z., et al. (2024). Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation. https://arxiv.org/abs/2409.04410

I welcome feedback, suggestions, or collaborations on extending this work.