Building a Production Medical AI: What Nature Medicine Didn't Publish

The EAGLE paper is live in Nature Medicine. The model is in production at MSK Cancer Center. It detects EGFR mutations in lung adenocarcinoma from whole-slide pathology images with state-of-the-art accuracy. I built on that research — adding uncertainty quantification and standing up the production inference pipeline.

What the paper doesn't describe is everything that happened between the first commit and the deployment on 24 H100 GPUs. That's what I want to write about.

The Infrastructure Was the Hard Part

GigaPath is a 1.3 billion parameter pathology foundation model. Fine-tuning it requires serious compute. Our training runs happened on a high-performance computing cluster running Slurm—a job scheduler used in scientific computing that most software engineers have never touched.

The first thing I learned: HPC is not cloud. There's no auto-scaling. You submit a job, it sits in a queue, it runs, it terminates. If your job crashes at hour 47 of a 48-hour run, you lose 47 hours of compute unless you've built checkpointing.

We lost a full training run before I built proper checkpoint management. That mistake cost three days.

The lesson that carried forward: in any system where a failure is expensive, you need to know your recovery point. Where does the job restart from? What state is recoverable? This applies equally to AI agents—a long-running agent that fails at step 37 of 40 needs to recover from step 37, not step 1.

The Thing I Built That Changed Everything

Standard neural networks output a prediction. They don't tell you how confident they are.

For cancer biomarker detection, that's a problem. A false positive means a patient undergoes unnecessary treatment. A false negative delays lifesaving intervention. A model that says "EGFR positive" without surfacing its uncertainty is genuinely dangerous.

I implemented Monte Carlo Dropout: run the same slide through the model 50 times with dropout active, measure the variance across runs. High variance = uncertain prediction = escalate to a pathologist for review.

What this created wasn't just a safety mechanism. It created a triage system. The model handles high-confidence cases autonomously. Uncertain cases go to human review. The pathologist's attention is directed where it's most needed.

This changed my entire philosophy about AI systems. The goal isn't to automate everything. The goal is to automate the right things—and to be honest about where automation runs out.

What Failed That Isn't in the Paper

Whole-slide images are enormous. A single slide can be 2–4 GB. Loading, preprocessing, and running inference on these at scale required a data pipeline that took longer to build than the model did.

The first pipeline was sequential. It processed one slide at a time. That worked for experiments. In production, with hundreds of slides queuing, it was unusable.

The second pipeline used multiprocessing to parallelize preprocessing across CPU cores while inference ran on GPUs. Better. But we had memory leaks in the preprocessing workers that would crash the pipeline after about 200 slides.

The third version was the one that shipped: asynchronous, bounded queues, worker restart logic, explicit memory cleanup between slides. None of this is in the paper. All of it is what makes the paper's results reproducible at scale.

What I'd Build First If I Did It Again

Observability.

I built logging, monitoring, and alerting last. I should have built them first.

In production AI, you need to know: what inputs is the model seeing? What's the distribution of confidence scores over time? Is performance degrading? Are there patterns in the cases that get escalated?

Without that visibility, you're flying blind. You don't know if the system is healthy, if it's drifting, or if there's a class of inputs it's systematically failing on.

Every AI system I build now starts with an observability layer before it starts with a model. I know what I want to measure before I build what I'm measuring.

The Principle I Carry Forward

The best AI systems are honest about what they don't know. They have explicit confidence bounds. They fail loudly when they should. They escalate to humans when they're uncertain. They maintain audit trails. They're designed to be understood and corrected, not just to produce outputs.

This sounds obvious when you say it out loud. It's shockingly rare in practice.

That's the principle behind everything I build and everything I teach. Not "make the AI smarter"—make the AI system more honest, more observable, and more correctable. That's what actually makes it useful.

The Infrastructure Was the Hard Part

The Thing I Built That Changed Everything

What Failed That Isn't in the Paper

What I'd Build First If I Did It Again

The Principle I Carry Forward

Build Your AI Agent Business

Get notified when I publish.