From 6 Hours to 2: Fixing the Memory Wall in Batch AI Pipelines
Why throwing more GPUs at batch inference doesn't work — and the three infrastructure optimizations that actually cut our pipeline runtime by 66%.
At Fractal.ai I'm building the harness around production agents — tool registry, evals, the plumbing around the prompt.
Before that, ML infra at Amazon — pipelines, serving, and deployment for inspection models running on millions of packages a day.
From demo to production
A demo is one prompt that works. Production is the session, harness, sandbox, eval, and guardrails that make it work again tomorrow when the model changed.
Append-only event log + memory
The loop: plan, call, retry, exit
Typed, discoverable, permissioned tools
CI for model swaps
Agentic infrastructure is the layer between the model and the user — session, harness, tool registry, evals. It's where most of the engineering cost of an AI product now lives.
Why throwing more GPUs at batch inference doesn't work — and the three infrastructure optimizations that actually cut our pipeline runtime by 66%.
Academic benchmarks show 90%+ accuracy. Production systems hit 10-31%. The dangerous gap is semantic errors that look correct but aren't.
Vector RAG fails at multi-hop reasoning. GraphRAG fixes it — until entity resolution breaks, and errors compound exponentially across every query.
We migrated a B2B UI library from React to Svelte 5. The bundle dropped 60%, FCP improved 400ms, and the real trade-offs were not what we expected.
Explore advanced React patterns including compound components, render props, and custom hooks that help build maintainable and scalable applications.