How we cut our LLM infra costs to zero π
Practical Strategies for Eliminating Unnecessary LLM Spend
Sarah Padovani
Jan 29, 2026
Running AI models usually means one thing: Expensive server fees. πΈ
We wanted to run translation models without the "cloud tax." Our solution? Transformers.js and ONNX Runtime to run models directly in the user's browser.
But the real challenge was hosting the model. Hereβs the journey of what we tried:
β Approach 1: Direct from Hugging Face
Result: Slow load times + downtime during HF maintenance. Frontend performance took a massive hit.
β Approach 2: Local Repo Storage
Result: Pushing heavy LLM weights into a git repo is a recipe for a bloated, unmanageable codebase.
β The Winner: S3 Bucket Hosting
By moving the models to our own S3 bucket, we hit the sweet spot:
Speed: Faster, more reliable delivery than public hubs.
Uptime: No more worrying about third-party maintenance.
Efficiency: We already used S3 for assets (like our translation CSVs), so it fit perfectly into our stack.
The takeaway: You don't always need a massive GPU server. Sometimes, the most scalable "server" is the one the user already owns. π»