How we cut our LLM infra costs to zero 🚀

Practical Strategies for Eliminating Unnecessary LLM Spend

Sarah Padovani
Jan 29, 2026

Running AI models usually means one thing: Expensive server fees. 💸

We wanted to run translation models without the "cloud tax." Our solution? Transformers.js and ONNX Runtime to run models directly in the user's browser.

But the real challenge was hosting the model. Here’s the journey of what we tried:

❌ Approach 1: Direct from Hugging Face

Result: Slow load times + downtime during HF maintenance. Frontend performance took a massive hit.

❌ Approach 2: Local Repo Storage

Result: Pushing heavy LLM weights into a git repo is a recipe for a bloated, unmanageable codebase.

✅ The Winner: S3 Bucket Hosting

By moving the models to our own S3 bucket, we hit the sweet spot:

Speed: Faster, more reliable delivery than public hubs.
Uptime: No more worrying about third-party maintenance.
Efficiency: We already used S3 for assets (like our translation CSVs), so it fit perfectly into our stack.

The takeaway: You don't always need a massive GPU server. Sometimes, the most scalable "server" is the one the user already owns. 💻

The SDLC and AI

Diarization: Make or Break