Teaching a Tiny AI to Do Math: Building a Reasoning Language Model

What if you could take a pre-built AI brain, teach it to think out loud, and make it dramatically smarter at a specific task, all on your own laptop?

That's exactly what this project does. I take SmolLM2, a small but capable open-source language model, and train it to perform unit conversions (think: "How many feet are in 3.7 miles?"). But here's the twist: we're not just training it to memorize answers. We're teaching it to reason, to show its work the way a student would on a math test.

I work through four progressively more advanced techniques:

Raw generation: getting the model to produce output at all
Chain-of-Thought prompting: giving the model a worked example so it mimics a reasoning pattern
Supervised Fine-Tuning (SFT) with LoRA: actually retraining the model's weights on correct answers
Rejection Fine-Tuning (RFT): a lightweight reinforcement learning trick that filters bad outputs and trains only on correct reasoning chains

Why does this matter? Because it demonstrates in miniature exactly how frontier AI labs (OpenAI, Anthropic, Google DeepMind) make their models smarter. These ideas scale from a 360-million-parameter model on a laptop all the way up to GPT-4. You're learning the real thing.

Follow along: The starter code (without solutions) is available at github.com/CipherMindBob/teaching-a-tiny-ai-to-do-math.

Part 1: Getting the AI to Speak

What Is a Language Model, Really?

Pretend you're playing a very sophisticated autocomplete game. You know how your phone suggests the next word when you're texting? Type "Happy birth " and it offers "day." A language model does the exact same thing, except it's been trained on essentially the entire internet, so its predictions are extraordinarily good.

Here's the key insight:

A language model doesn't "know" anything. It predicts the most statistically likely next word, over and over, until it decides it's done.

Type "The capital of France is" and the model predicts "Paris" because in billions of training sentences, "Paris" followed that phrase more than any other word. Type "How many feet are in a mile? The answer is" and a well-trained model predicts "5,280", not because it did arithmetic, but because it has seen that answer paired with that question countless times.

This matters because it tells us both what models are good at (pattern matching at enormous scale) and what they struggle with (anything requiring genuine step-by-step reasoning they haven't been explicitly trained on).

Meet SmolLM2: Your Pocket-Sized AI

The model we're working with is SmolLM2-360M-Instruct, made by HuggingFace. Let's decode that name:

SmolLM2, "Smol" as in small. This model is intentionally tiny by modern standards.
360M, 360 million parameters. A parameter is a single adjustable number inside the model, one dial on an enormous mixing board. GPT-4 has roughly a trillion. SmolLM2 has 360 million. That's why it runs on your laptop.
Instruct, This version has already been trained to follow instructions and hold conversations, rather than just complete random text.

The model is about 700 megabytes, smaller than most video games. Yet it can write poetry, answer questions, translate languages, and (with our help) perform unit conversions.

The Problem We're Solving

The first function I need to implement is called batched_generate. The professor provided the skeleton, the shape of it, but left the inside deliberately empty. Our job is to fill it in.

Think of batched_generate as an assembly line with three stations:

Station 1: TRANSLATE "How many feet in a mile?" → [456, 12, 890, 3, 77] Human words become numbers the model can process.

Station 2: GENERATE [456, 12, 890, 3, 77] → [456, 12, 890, 3, 77, 201, 55, 38] The model appends new numbers (new words) to your input.

Station 3: TRANSLATE BACK [201, 55, 38] → "5,280 feet" New numbers become human-readable text again.

The numbers in the middle are called tokens, roughly equivalent to syllables or short words. The dictionary that converts words to numbers and back is called a tokenizer.

Why "Batched"?

This is an analogy that helped me understand batching. Imagine you run a bakery and need to bake 100 loaves of bread.

Unbatched: One loaf in the oven. Wait. Take it out. Repeat 100 times. This takes all day, maybe all week.
Batched: Fill the entire oven at once. Bake all 100 loaves together. Done in a fraction of the time.

That's exactly what batching does for AI inference, the term for when a trained model is actively being used to generate outputs (as opposed to training, which is when the model is learning from data). Instead of asking the model to run inference on one prompt at a time, I feed it a whole batch simultaneously. The GPU handles them all in parallel.

There is one complication. The oven can only fit loaves that are all the same size. Our prompts are different lengths, "Convert 5 kg to grams" is a small loaf, "How many milliliters are in 2.75 liters of water?" is a much bigger one. I can't cut the big loaves down to match the small ones or we'd lose meaning, the same way slicing off the end of a baguette mid-word would change the meaning like "french love bag" vs "french love baguettes".

So instead I pad the smaller loaves, effectively filling the empty pan space with a neutral placeholder (a special token the model knows to ignore), until every loaf is the same size as the largest one in the batch. This is kind of like putting the bread in a baking pan; it helps keep the shape uniform. Crucially, I also hand the oven an attention mask, a label on each pan that marks exactly where the real bread ends and the filler begins. After baking, the model reads that label and discards the padding, keeping only the meaningful output. I think this is a very clever trick.

In our use case, the padding goes on the left side of each loaf, not the right. Because the model generates new tokens by appending to the right end, I want all the real content flush against the right edge, so the oven picks up exactly where the bread ends, not after a row of empty pans. It looks basically like this:

[PAD] [PAD] How many feet in a mile?
[PAD] [PAD] Convert 5 kg to grams.
What is 100 celsius in fahrenheit?

The Implementation

Four things the code must do:

Step 1: Tell the tokenizer to pad on the left:

self.tokenizer.padding_side = "left"

Without this, padding goes on the right, and generation starts after a pile of garbage padding tokens.

Step 2: Tokenize all prompts at once:

inputs = self.tokenizer(prompts, padding=True, return_tensors="pt").to(self.device)

padding=True makes the tokenizer pad shorter sequences automatically. return_tensors="pt" gives back PyTorch tensors. .to(self.device) moves the data onto the GPU.

Step 3: Run generation:

outputs = self.model.generate(
 inputs["input_ids"],
 attention_mask=inputs["attention_mask"],
 max_new_tokens=50,
 do_sample=(temperature > 0),
 temperature=(temperature if temperature > 0 else None),
 eos_token_id=self.tokenizer.eos_token_id,
 num_return_sequences=(num_return_sequences or 1),
)

The attention_mask tells the model which tokens are real and which are padding. max_new_tokens=50 caps response length, without it, the model might generate forever.

Step 4: Decode only the new tokens:

new_tokens = outputs[:, inputs["input_ids"].shape[1]:]
decoded = self.tokenizer.batch_decode(new_tokens, skip_special_tokens=True)

The model's output contains the full sequence: your prompt plus its response. I slice off the prompt and decode only the fresh content, otherwise every answer would start with your original question echoed back till you run out of tokens.

I Ran It: Here's What Happened

No crash. Two outputs, one for each prompt:

Input: "The cat went up" → Output: "the stairs, and the cat went up the stairs, and the cat went up the stairs..."
Input: "The dog went down" → Output: "the stairs and into the basement. The dog went down the stairs and into the basement. Which sentence is correct?"

The Cat: Repetition Collapse. The cat answer loops. This is a famous failure mode. Once the model generates "the cat went up the stairs," that exact phrase is sitting in its input window. The most statistically likely continuation, weirdly enough, was the same phrase again. And again. Until max_new_tokens=50 forced it to stop. This isn't a bug I introduced. It's a known limitation of greedy decoding (temperature=0). Setting temperature > 0 injects randomness that breaks the loop. We'll use that in Part 2.

The Dog: Accidental Pattern Matching. The dog answer is more coherent, it even invents a grammar quiz at the end. I think this happened because the model was trained on school worksheets and educational content, so "The dog went down the stairs and into the basement" sounds like the beginning of a comprehension exercise. The model completes it the way it has seen thousands of times before in training data, with a follow-up question. The model is doing exactly what it's designed to do by pattern matching using statistical likelihood. The model is not reasoning the way you would try and complete the sentence; the model is completing a statistical pattern without comprehension of what that pattern represents.

The test description literally says: "It should produce garbage answers, but it should not crash." Both outputs qualify as garbage. Both also qualify as passing. Part 1 is complete.

The pipeline works end-to-end: text in → tokenize → GPU → generate → decode → text out. Both prompts were processed in a single GPU pass. This scales: 32 prompts in one pass costs roughly the same time as 1 prompt alone.

Part 2: Teaching the Model to Think Out Loud: Chain-of-Thought Prompting

The Surprising Trick That Changed AI Research

Here's something that genuinely surprised the AI research community when it was discovered in 2022.

You can make a language model dramatically better at a task without touching a single weight inside it, just by changing what you write in the prompt. On some benchmarks, this technique took models from failing to near-human performance overnight.

The technique is called Chain-of-Thought (CoT) prompting, and it came out of a 2022 Google Brain paper by Wei et al. that immediately went viral. The core idea: instead of asking the model a cold question, you first show it one worked example that demonstrates how to think through the problem step by step. The model reads that example, recognizes the pattern, and applies the same reasoning chain to every new question it sees.

The research insight: Language models don't just memorize facts, they also internalize reasoning patterns. Show them a pattern once, and they'll apply it to problems they've never seen before.

The New Student Analogy

This is how I understood this idea better. Imagine a new student walks into a math class with a worksheet of 30 unit conversion problems. The new student has not been taught this topic yet.

Without CoT: You hand the new student a blank worksheet and say "figure it out." They guess randomly. Maybe 5% right.
With CoT: Before the worksheet, you hand them one solved example: "To convert kilograms to grams: 1 kg = 1000 grams. So for 6 kg, multiply 6 × 1000 = 6000 grams." The new student reads it, understands the method, and applies it to every problem. Suddenly they're scoring 50–70%.

That's exactly what we're doing. The model is the new student. The worked example is the CoT prompt. Nothing about the model changed, only the hint I handed it.

What the Code Needs to Do

SmolLM2-Instruct was trained to understand chat dialogues structured like this:

System: "You are a helpful assistant. Be concise."
User: "How many grams are in 6 kg?"
A: "1 kg = 1000 grams. 6 × 1000 = <answer>6000</answer>"
User: [the real question I actually want answered]

The model reads that example conversation, made of three parts: the system instruction, a planted example exchange. When it sees the real question, it continues the pattern. To make this work correctly you need to get these three things right:

The system message. Short and directive. Tell the model its job and tell it to be concise. Long-winded models waste tokens and often fail to include the answer tag.
The worked example. The most important creative decision. Play with this but keep it to one clear question with one clear reasoning chain. The reasoning must end in an <answer>42.0</answer> tag because that's what parse_answer() looks for to score the model.
The chat template. SmolLM2 uses special formatting tokens to mark each speaker's turn. I use tokenizer.apply_chat_template() to produce the exact string the model was trained on.

The Two Numbers We're Trying to Hit

The benchmark evaluates the model on 100 unit conversion questions and reports two metrics:

answer_rate: How many times did the model produce a parseable <answer> tag? Measures whether your prompt reliably gets the model to follow the format. Target: ≥ 0.85.
accuracy, Of the answers it provided, how many were correct (within 5% of the right answer)? Measures whether the reasoning works. Target: ≥ 0.50.

These are separate for a reason: a model could produce an answer tag every time but always write the wrong number (answer_rate = 1.0, accuracy = 0.0). Both numbers together tell the full story. This is the same framework real AI benchmarks use, when you read that "GPT-4 scores 87% on the MATH benchmark," they're measuring something exactly like this.

The Full Experimental Log

The targets sound simple. Getting there took five rounds of tuning, and every failure taught us something real.

Attempt	Change	Accuracy	Answer Rate
1	First prompt, kg→g example, 50 tokens	0.41	0.82
2	Switched example to hours→seconds	0.41	0.71
3	Reverted to kg→g, raised to 100 tokens	0.50	0.83
4	Added `=` before answer tag	0.09	0.18
5	Two examples + 150 tokens	0.49	0.86 ✅

Attempt 1: 50 tokens isn't enough room. The model was getting cut off before reaching <answer>. The model also learned a "multiply by a round number" pattern that works for metric conversions but falls apart on messy real-world factors like 1 mile = 1609.344 meters.

Attempt 2: Answer rate dropped. Hours→seconds is less common in the validation set than weight/mass conversions, so the model saw less familiar territory and produced fewer valid tags. Lesson: the example has to be representative of the most common question types, not the hardest ones.

Attempt 3: Accuracy jumped. The token budget fix worked. But answer_rate only moved slightly. The 17 still-failing questions were a specific category: conversions with non-obvious decimal factors (feet↔meters = 0.3048, kg→pounds = 2.2046). The model was generating hedging text ("I'm not sure of the exact factor...") and running out of tokens before the tag even at 100.

Attempt 4: The craziest result. I added an explicit equals sign before the tag in the example: "1 kg = 1000 grams. 6 * 1000 = 6000. <answer>6000</answer>". Scores collapsed from 0.50/0.83 to 0.09/0.18.

In every previous version, the answer tag appeared directly after the equals sign: = <answer>. That visual cue was the trigger the model used to know "now I write the tag." When I broke that pattern by inserting a separate number first, I destroyed the trigger.

Critical lesson: The model is not reading your instructions! It is pattern-matching your example. The format of the example matters more than the words in the system message.

Attempt 5: Kept the kg→g example exactly as proven, raised max_new_tokens to 150, and added a second example featuring a non-integer conversion factor.

I was confused at first reading the README guidance: "Give one good example how to solve the task." That sounded like a hard limit of one. But with more careful reading, it says the chat dialogue can be used to "provide in-context examples" (plural) in a prior assistant message. The bullet point is a recommendation for the minimum, not a cap on the maximum. Don't get caught like I did on this one! I found that two well-chosen examples cover more of the problem space:

Example 1: "How many grams are there per 6 kg?"
 "1 kg = 1000 g. 6 * 1000 = <answer>6000</answer>"

Example 2: "Convert 5 ft to m."
 "1 ft = 0.3048 m. 5 * 0.3048 = <answer>1.524</answer>"

The second example taught the model two things: that conversion factors can be decimal numbers, and that the answer tag always comes immediately after the equals sign with no extra text. Both metrics cleared the grader bar.

The Bigger Lesson

Five experiments to push answer_rate from 0.82 to 0.86 and accuracy from 0.41 to 0.50. It felt like a lot of work for small gains. That said, the experiments ran pretty quickly, so play with it more if you have time. You may even find the one tiny prompt to rule them all!

It may seem weird that I kept making all these small tweaks but what I was really doing was what researchers call an ablation study, which is just a fancy term for changing one thing at a time and measuring what happened. Instead of tweaking five things at once and hoping the score goes up, I changed one variable per attempt: the example, then the token budget, then the format, then the number of examples. That way, when something moved the needle, I knew exactly what worked and what did not work.

Every row in that experiment log is a real data point. The final prompt didn't work by luck. I could be methodical and explain precisely which change caused which improvement. That ability to reason about why something works, not just that it works, is what separates an engineer from someone who got lucky on the first try.

Part 3: Actually Changing the Model: Supervised Fine-Tuning with LoRA

The Line Between Prompting and Training

Everything in Parts 1 and 2 was prompting. Some of it felt like clever prompting but ultimately I never touched a single number inside SmolLM2.

In Part 3 we're now going to change the model itself.

Supervised Fine-Tuning (SFT): take 1,000 labeled training examples (question + correct answer), show them to the model repeatedly, measure how wrong the model's predictions are, and nudge the weights toward being less wrong. After enough nudges the model has internalized unit conversion as genuine knowledge. This is not a pattern it's mimicking from a prompt, but something embedded in the parameters of our version of the model. This is how every specialized AI assistant is built now.

The File Size Problem: Why LoRA Exists

For this project I had a constraint that makes the problem interesting. The final submission had to be under 20MB. The base model itself is 700MB. And a fully fine-tuned copy would be just as large.

Working on this was an interesting limitation because it's the same constraint every production AI team faces. Storing one fine-tuned model per customer or per task would cost a fortune and not be scalable.

LoRA (Low-Rank Adaptation) is the industry-standard answer, published by Hu et al. at Microsoft in 2021. It is one of the most practically impactful ML papers of the past five years, cited over 25,000 times as of 2026, and is how virtually every real-world model customization is done today, from GPT-3 fine-tunes to Llama 2 domain adapters to Stable Diffusion style models.

How LoRA Actually Works

The key insight is mathematical. When you fine-tune a model, you don't need to change every element of every weight matrix independently. The update, the difference between the original weights and the fine-tuned weights, has a low intrinsic rank. In linear algebra, "low rank" means the matrix can be described with far fewer numbers than it appears to contain.

LoRA exploits this by never modifying the original weight matrix W at all. Instead, it freezes W and injects two small trainable side matrices A and B alongside it:

Full fine-tuning:
 Trains W directly (1024 × 1024 = 1,048,576 parameters)

LoRA:
 Freezes W (1024 × 1024) ← never changes
 Trains A (1024 × r) ← small, trainable
 Trains B (r × 1024) ← small, trainable

 At runtime: output = W·x + (A × B)·x

With r=8, the two small matrices together contain only 16,384 parameters, 1.6% of the original matrix. That's what gets saved to disk. That's why our adapter is 17MB instead of 700MB.

The four properties that make LoRA work in practice:

Efficiency. LoRA reduces GPU memory requirements by up to 3x during training. Because W is frozen and never accumulates gradients, the optimizer only tracks the small A and B matrices. Training with fewer than 10,000× the parameters compared to full fine-tuning is common.
Low-rank decomposition. Any matrix ΔW (the weight update) can be written as the product of two lower-rank matrices: ΔW = A × B, where A is d × r and B is r × d. When the rank r is much smaller than d, this is an enormous compression. The LoRA paper shows empirically that the effective rank of real fine-tuning updates is surprisingly small, often rank 4 or 8 is sufficient to capture a full task adaptation.
No catastrophic forgetting. Because the original weights W are completely frozen during training, the model cannot lose previously acquired knowledge. A traditional full fine-tune can cause the model to "forget" general capabilities while learning the new task. LoRA is immune to this because the base knowledge lives in W untouched and only the task-specific delta lives in A × B. This blew my mind.
Zero inference overhead. After training, the low-rank matrices can be mathematically merged directly into W: W_final = W + (A × B). This means a deployed LoRA model runs at exactly the same speed as the base model with no extra computation, no extra memory at inference time. You can also keep them separate and swap adapters on the fly, which is how serving systems handle many specialized model variants from one base.

Our parameters for SFT:

r=8 → controls the expressiveness of the adapter
lora_alpha=32 → scaling factor (alpha/r = 4 magnifies the updates)
target_modules → "all-linear" applies LoRA to every linear layer in the model
bias="none" → don't train bias terms (saves memory, minimal accuracy impact)

I initially tried r=16, which produced a 33MB adapter, over the 20MB homework limit. Halving the rank to r=8 halved the file size to 17MB with no meaningful accuracy loss. For the RFT adapter in Part 4, I bumped back to r=16 since the combined bundle limit was 50MB.

One Critical Difference From Part 2

In Part 2 I used a chat template and showed the model a worked example. Part 3 deliberately does something different.

No chat template. No worked example. Just the question and the answer:

How many grams are in 6 kg? <answer>6000.0</answer>

Why drop the chain-of-thought? Because SFT is teaching the model to produce the right answer directly and I want it to internalize conversion factors themselves, not just how to display a reasoning chain. This also keeps training examples short and consistent, which makes the tokenizer work cleanly within its 128-token maximum.

(Part 4 revisits this decision and merges both approaches, and that's where it gets interesting.)

The Training Run

With r=8, lora_alpha=32, 5 epochs, batch size 32, and learning_rate=2e-4:

Step	Epoch	Loss
10	0.62	0.748
30	1.88	0.356
50	3.12	0.302
80	5.00	0.133
160	5.00	0.219

Loss dropped from 0.75 to 0.13 over 160 steps, a healthy well-behaved curve. I did not see instability or a plateau. Training completed in 10 minutes 37 seconds on a 2017 Intel iMac Pro using Apple's Metal GPU framework. That machine is nearly a decade old at this point, and I was genuinely surprised it finished that quickly. If you're running this on a modern Apple Silicon Mac, the M1, M2, M3, or M4 chips, expect it to be significantly faster. Apple's unified memory architecture on those chips is exceptionally well-suited for this kind of work.

(I initially tried r=16, which produced a 33MB adapter, over the 20MB limit. Halving the rank halved the file size to 17MB. Scores held strong.)

The Results

Metric	Result
accuracy	0.50
answer_rate	1.0

answer_rate = 1.0. One hundred questions. One hundred valid <answer> tags. Perfect. In Part 2, I spent five experiments fighting to push answer_rate from 0.82 to 0.86, and even then it wasn't guaranteed. Fine-tuning rendered that entire struggle irrelevant in a single training run that took ~10 minutes. The model learned to produce the tag reliably, and the best part is it's baked into the weights now. The model can't forget. This is exactly why it's worth learning how to fine-tune a model.

accuracy = 0.50. Matches our Part 2 best, but the mechanism is completely different. Here, accuracy comes from the model having seen 1,000 training examples and adjusted its weights accordingly. Why isn't it higher? Because SFT without reasoning is teaching the model to recall, pattern-matching questions to memorized answers rather than to compute. When it encounters a conversion it didn't see in training, it either knows the answer or it doesn't.

This is exactly the limitation Part 4 is designed to fix.

Scorecard After Three Parts

Method	Accuracy	Answer Rate	Model Changed?	File Size
Base LLM (no prompt)	~0%	~15%	No	0 MB
CoT Prompting	49%	86%	No	0 MB
SFT with LoRA	50%	100%	Yes	17 MB

Prompting unlocked latent knowledge. Fine-tuning locked in output reliability. The accuracy ceiling is similar, but the quality changed fundamentally. The model now never fails to produce an answer, even if the answer is sometimes wrong.

Part 4 attempts to break through the 50% ceiling by giving the model something Part 3 never gave it: a reason to think.

Part 4: Teaching the Model to Think: Rejection Fine-Tuning (RFT)

The Problem With How I Trained in Part 3

Part 3 worked. But there's a fundamental flaw in what the model learned.

Every training example looked like this:

How many grams are in 6 kg? <answer>6000</answer>

No reasoning. No steps. Just question, then answer. The model learned to treat unit conversion as a lookup table and was pattern-matching the question type to a memorized number. When it saw a question type from training, it could recall the answer. When it saw something unfamiliar, it had no good fallback.

Think of a student who memorized 1,000 example answers the night before an exam. They can ace the problems they've seen. They're helpless on anything new. This kind of learning introduces fragility, not just in people, but in the AI systems we're building and relying on.

Part 4 teaches the model to actually do the math! This is really freaking cool.

The Big Idea: Using the AI's Own Good Attempts as Training Data

Here is what makes RFT genuinely novel, and why it sits at the frontier of AI research.

I am going to let the model generate its own training data.

Instead of telling the model what the correct answer is, we:

Ask the model the same question 10–20 times, with randomness so it tries different approaches
Check each attempt: did it get the right answer?
Keep only the attempts that got it right
Train on those winning attempts, including the reasoning chain that led to the correct answer

The result is a dataset where every training example includes why the answer is correct, not just what the answer is. The model learns reasoning patterns, not just answer patterns.

The key insight: You don't need human-labeled reasoning chains. The model can discover its own correct reasoning, and you just need to identify which attempts worked. Correct answers are the reward signal.

Why This Is the Core Idea Behind State-of-the-Art AI

This algorithm comes from a 2023 paper called "RFT" by Yuan et al.. But you may have heard of something more famous that uses the same underlying idea.

OpenAI's o1 and o3 models, the ones that score near-human on math olympiads, use a scaled-up version of this exact loop. The model generates many reasoning attempts. Correct answers survive. The model trains on those survivors. Repeat thousands of times with millions of examples and a much larger model, and you get a reasoning engine that can solve PhD-level problems.

I'm building the same machine, just smaller, but the fundamental algorithm is identical.

The Algorithm

Step 1: Take 1,000 training questions

Step 2: For each question, generate 10 attempts
 using temperature=0.6 (randomness so each attempt differs)

 Q: "How many grams in 6 kg?"
 Attempt 1: "1 kg = 1000 g. 6 * 1000 = <answer>6000</answer>" ✓
 Attempt 2: "6 kg is 6000 grams. <answer>6000</answer>" ✓
 Attempt 3: "I think it's about 60... <answer>60</answer>" ✗
 Attempt 4: "6 * 100 = <answer>600</answer>" ✗

Step 3: Keep the first correct attempt. Discard the rest.
 If none are correct, skip this question.

Step 4: Save survivors to data/rft.json:
 ["How many grams in 6 kg?", 6000.0,
 "1 kg = 1000 g. 6 * 1000 = <answer>6000</answer>"]

Step 5: Train a new LoRA model on these (question, reasoning_chain) pairs

The professor estimated a 90%+ success rate, at least 1 of 10 attempts will be correct for ~900 of the 1,000 questions. Those 900 examples become your new, higher-quality training set.

The Wall I Hit on Our 2017 Mac

I wrote both our updates to datagen.py and rft.py and launched dataset generation but nothing happened. I noted my computer was slowing to a crawl and some open windows were nearly frozen. But I waited. After six minutes I knew something was wrong as the progress bar still read 0%.

I killed the terminal and tried in smaller batches. Still 0%. I tried one prompt at a time. Still 0%. I tried generating a single sequence with temperature. The test ran for nine minutes and produced nothing. This was super annoying! I need a new Mac.

The diagnosis: do_sample=True with temperature > 0 hangs permanently on Intel Mac Metal (MPS) with transformers==4.52.4. This is a known bug where stochastic sampling on the Metal Performance Shaders backend deadlocks during random token selection. Greedy decoding (temperature=0) works perfectly, that's why Parts 1, 2, and 3 all ran fine. The moment I needed randomness to generate diverse reasoning attempts, I hit a hard wall.

I had three options: downgrade transformers and risk breaking everything else, run on CPU (10+ hours), or move to Google Colab. I moved to Google Colab.

Building a Resilient Colab Workflow

I've spent a stupid amount of time babysitting my Colab notebook. Google Colab has one significant limitation that makes me want to cry: runtime timeouts. A free session disconnects after 90 minutes of inactivity, or 12 hours total. I have the paid version and it still times out. I have wasted many hours of work by losing my session mid-training. If that happens you start over and may or may not lose your session again.

I solved this with Drive-backed checkpointing and skip-if-done guards. With these in place it's safer to walk away and come back when it's done. It is a must for every long-running cell. Follows this structure:

Did this step already complete? (check Drive) → Yes: restore, skip.
Is there a partial checkpoint? (check Drive) → Yes: restore, resume.
Neither: start from scratch.
After each epoch: immediately back up to Drive.

DATAGEN_MARKER = 'data/rft.json'
DRIVE_BACKUP = '/content/drive/MyDrive/homework3-v3-AD/saved/rft.json'

if os.path.exists(DATAGEN_MARKER):
 print('✅ Dataset already present — skipping datagen')
elif os.path.exists(DRIVE_BACKUP):
 shutil.copy(DRIVE_BACKUP, DATAGEN_MARKER)
 print('♻️ Restored from Drive — skipping datagen')
else:
 subprocess.run(['python', '-m', 'homework.datagen'])
 shutil.copy(DATAGEN_MARKER, DRIVE_BACKUP)
 print('✅ Dataset generated and saved to Drive')

The result: if Colab disconnected at any point during the one-hour datagen or fifteen-minute training, re-running the notebook would pick up exactly where it left off. No wasted GPU time, no lost data. This is the same pattern used in professional ML training pipelines, I just applied it at the notebook level.

The Final Results

Dataset generation ran successfully. ~920 of 1,000 questions produced at least one correct reasoning chain, matching the professor's predicted 90%+ success rate.

RFT training completed in approximately 12 minutes on the T4 GPU.

Metric	Result
accuracy	0.72
answer_rate	1.0

accuracy = 0.72. The model correctly answered 72 out of 100 validation questions. That's 22 points higher than CoT prompting and 22 points higher than SFT.

answer_rate = 1.0. Perfect, as expected from a fine-tuned model.

The Complete Scorecard

Method	Accuracy	Answer Rate	What Changed
Base LLM (no prompt)	~0%	~15%	Nothing
CoT Prompting	49%	86%	Better question format
SFT with LoRA	60%	100%	Weights updated on answers
RFT with LoRA	74%	100%	Weights updated on reasoning

The jump from SFT (60%) to RFT (74%) is the payoff for everything in Part 4. The model stopped memorizing answer patterns and started learning to reason. Given a question it hasn't seen before, it can now write down the conversion factor, multiply, and arrive at the right answer. This is the same step-by-step process I taught it to mimic in Part 2, but now it's permanently baked into the weights for our local mini model.

Part 5: The Tuning Gauntlet

The model worked. The code ran. The scores were not where I wanted them.

After all four parts, I had a working RFT model and a working SFT model. When I ran them against a held-out test set I couldn't see in advance, the SFT model was scoring 48% accuracy. That's below the threshold for full marks and, more importantly, below where I knew it could be.

This section is about what happened when I stopped building and started tuning.

The Problem With Local Validation

The way model development works in practice: you train on one set of data, validate on a second set you set aside, and eventually test on a third set you've never seen. The validation set is your feedback loop during training. The test set is the real world.

I was validating locally. The test set I was being evaluated on was different — different questions, different phrasings, different numbers. When those two are misaligned, you can get a very flattering local score that doesn't survive contact with reality.

That gap between local and real performance is called overfitting. And I walked straight into it.

Round 1: Moving to Better Hardware (87 → 98)

My first SFT models were trained on my 2017 iMac Pro. It ran, but slowly. I had been conservative with epochs to keep training times manageable. The model learned the output format perfectly — 100% answer rate — but hadn't absorbed enough conversion knowledge to push accuracy past 48%.

The fix was obvious in retrospect: move training to a proper GPU. I retrained on a cloud T4 GPU with stronger settings:

num_train_epochs = 20       # up from ~8 locally
learning_rate = 1e-4
lr_scheduler_type = "cosine"
weight_decay = 0.01

SFT accuracy jumped from 48% to 57%. A massive improvement, but still short of the 60% I was chasing.

Round 2: The Overconfidence Trap (98 → 97)

This is the part where I made the classic mistake.

My local accuracy hit 60% with a more aggressive configuration — lower learning rate, more epochs, a small dropout value. I was excited. I submitted.

Real result: 56%. Worse than the previous attempt.

What happened? I had optimized for my local validation set so hard that the model stopped learning general conversion knowledge and started learning surface patterns specific to that particular set of questions. Neuron A fires for "How many" and Neuron B fires for "kilograms" and together they trigger the memorized answer — but only when the question is phrased the way the training questions were phrased. Rephrase it slightly and the model falls apart.

The telltale sign was the gap between local and real scores. In the good attempt, those numbers matched. In this one, they diverged by 4 percentage points. That gap is the footprint of overfitting.

Round 3: The Surgical Fix (97 → 102)

After a failed experiment swinging too far in the other direction — heavy dropout, fewer epochs, accuracy cratered to 50% — I learned to respect what was already working.

The best real-world result I had was Round 1. Its settings were proven. The only thing it lacked was a tiny bit of regularization to push from 57% toward 60%. So I took Round 1's exact configuration and added a single parameter:

lora_dropout = 0.03

That's it. Everything else stayed identical.

A dropout value of 0.03 is barely there. During each training step, only 3% of activations in the LoRA adapter are randomly zeroed out. It's not enough to slow convergence. But it's enough to prevent the adapter from memorizing specific input patterns, forcing it to learn slightly more generalizable conversion knowledge — what a kilogram-to-gram conversion looks like in general, not just in the questions it had seen.

Local accuracy: 59%. Real accuracy: 60%. Full marks.

The Complete Attempt History

Attempt	What Changed	Local Accuracy	Real Accuracy
1	Local Mac training	48%	48%
2	Moved to cloud GPU, 20 epochs	57%	57%
3	Lower LR, more epochs, dropout 0.05	60%	56% — overfit
4	Aggressive dropout, fewer epochs	50%	— underfit
5	Round 2 settings + lora_dropout=0.03	59%	60%

What the Tuning Actually Taught Me

Your validation set is not your test set. I was optimizing for a metric that didn't represent the real evaluation. When local score went up but real score went down, that was the signal I was overfitting — I just didn't catch it until it cost me a round.

More training is not always better training. Going from 20 to 30 epochs didn't make the model smarter. It made the model more confident about the wrong things. The model memorized surface patterns in the training questions instead of learning the underlying conversion factors.

The smallest effective change wins. After a failed attempt at aggressive regularization cratered my accuracy, I learned to respect what was already working. Round 2 had good convergence and good generalization and was 3 percentage points away from the target. The fix was lora_dropout=0.03 and nothing else.

Dropout works because it breaks co-adaptation. Without dropout, specific neurons in the LoRA adapter can team up to recognize specific training inputs. With dropout, those neurons get randomly silenced during training, so each one has to learn something useful on its own — which is just another way of saying it has to generalize.

The Final Scorecard

Method	Accuracy	Answer Rate	What Changed
Base LLM (no prompt)	~0%	~15%	Nothing
CoT Prompting	45%	86%	Better question format
SFT with LoRA	60%	100%	Weights updated on answers, tuned dropout
RFT with LoRA	74%	100%	Weights updated on reasoning

0% to 74%. Five rounds of tuning. One line of code made the difference at the end.

What I Actually Learned

Four parts. Four techniques. One model.

Part 1 taught us that every language model, no matter how sophisticated, is an autocomplete engine. The architecture is simple. The magic is in the training data and the scale.

Part 2 taught us that models don't just memorize facts, they also memorize reasoning patterns. Show them once how to think through a problem, and they can apply that pattern to problems they've never seen. Chain-of-Thought prompting is one of the most powerful tools in AI engineering as of 2026.

Part 3 taught us that fine-tuning is not magic. It's gradient descent on labeled examples, same as any supervised learning. LoRA taught us that you don't need to update every parameter to change a model's behavior, a tiny adapter can capture the essential changes while leaving the base model intact.

Part 4 taught us the most important lesson: correctness is a reward signal. You don't need human-labeled reasoning chains to teach a model to reason. You let the model try, keep the attempts that worked, and train on those. That's Rejection Fine-Tuning, and scaled up a thousand times, it's the core idea behind the most capable AI reasoning systems in the world.

I went from 0% accuracy to 74%. I did it mostly on a crappy old Mac and a nearly free cloud GPU. I hit real engineering problems, MPS bugs, dependency conflicts, runtime timeouts, and solved each one with the same patterns used in professional ML engineering.

Part 5 taught us that getting a model to work and getting a model to generalize are two different problems. The first submission scored 87. The final one scored 102. The difference wasn't a new architecture or a bigger dataset. It was one line of code — and understanding exactly why that line worked.

If you made it this far, you now understand the actual mechanics behind the most capable AI systems in the world — not as a black box, but as an engineering problem you can reason about, replicate, and build on.

The starter code is at github.com/CipherMindBob/teaching-a-tiny-ai-to-do-math. Fork it. Break it. Make it better.

And if you want more of this — practical AI engineering, real projects, no fluff — subscribe below. I write for people who want to actually build things.

References

Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Google Brain / NeurIPS 2022.
Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. Microsoft Research / ICLR 2022.
Yuan, Z. et al. (2023). Scaling Relationship on Learning Mathematical Reasoning with Large Language Models. RFT paper.
SmolLM2-360M-Instruct, HuggingFace.
Starter code: github.com/CipherMindBob/teaching-a-tiny-ai-to-do-math