Interactive Analysis: The Economics of AI Infrastructure

How Should We Buy LLM Compute?

As GPU prices fall and open-source models mature, the buy-vs-rent calculus for AI infrastructure is shifting. For more and more organizations, the question is becoming relevant.

Josh Harris
By Josh Harris
Data updated 2026-01-18

Every team building with LLMs faces the question of how to buy compute. For most workloads, the answer has long been straightforward—rent from the cloud. But LLM inference changed the calculus. It's expensive, in high demand, and often involves sensitive data. Meanwhile, consumer hardware like Apple's M-series chips and NVIDIA's DGX Spark has become surprisingly capable—easy to run, easy to secure, and sometimes more convenient than renting compute.

At AnswerLayer, we deploy on customer infrastructure—which means we think constantly about how to maximize value from whatever compute power our customers own or plan to buy. The calculus isn't simple. Cloud providers offer raw speed—an H100 cluster can process tokens 2-4× faster than consumer hardware. API providers abstract infrastructure entirely. The question isn't which is "best," but which makes sense for a company's specific workload and time horizon.

I built this calculator to answer a simplified question: how many tokens does it take to make buying hardware worthwhile? The answer depends on your workload, your time horizon, and your assumptions about future pricing—which is where supply chain dynamics come in.

Note: This calculator presents a heuristic for thinking about the cost of compute—one important factor in the buying decision, but not the only one. Other considerations include latency requirements, data privacy, operational complexity, and organizational capacity to manage infrastructure.

The GPU Supply Chain

Risk factors that could affect future compute pricing

The calculator below assumes today's pricing holds steady. The consensus view is that compute costs will continue falling as they have historically. What if that's not the case? Understanding the risks to that view helps frame the buy-vs-rent decision.

Demand still acceleratingThe IEA projects data center electricity consumption will double by 2026, driven primarily by AI workloads. Enterprise AI adoption is still in early innings.

Supply can't keep upTSMC's CEO says capacity is "three times short" of demand. HBM memory faces 6-12 month lead times.

Geopolitical concentrationTSMC commands 70% of the global foundry market and produces virtually all leading-edge chips. Impacts would be felt at the margins of highest performance and new capacity, increasing demand for existing supply.

Years to expandNew fabs take 2-4 years to build; announced 2025 capacity won't arrive until 2027+.

The key question: At what utilization level does owning hardware make sense?

The calculator below shows the break-even point at today's prices. If cloud/API costs rise, that threshold drops; if they fall, it rises. Your view on these supply chain dynamics shapes how you interpret the numbers.

Configure Your Scenario

Select your hardware configuration, define your workload, and see how the costs compare across providers. The numbers update in real-time as you adjust parameters.

Assumptions and References
Qty:1
96128192256512
$9,899 USD
×
320GB
×
128GB
Memory Usage448GB / 512GB
gpt-oss-120b: 320GB (×4)gpt-oss-20b: 128GB (×8)
1h12h24h
Inference: Model weights + KV cache only
1:991:41:14:199:1

Affects API costs. Code generation uses more output; RAG/summarization uses more input.

Your Configuration at a Glance

Based on your selected hardware and workload

Local Hardware
$9,899

Mac Studio M3 Ultra (512GB)

Throughput
840
tok/s

36.3M tokens/day

Cheapest API
$4
/day

Together @ $0.12/1M

Payoff Time
6.4
years

Break-even vs cheapest API

Cloud GPU Rental

Rent raw compute by the hour. Faster than local hardware, but you pay for every minute of runtime.

Provider Comparison

Hours adjusted to produce 36.3M tokens/day — same as local

ProviderGPUsCloud tok/sSpeedupHrs needed$/hr$/day$/moPayoff
Denvr
$2.1/GPU/hr
6× H10016001.9×6.3h$12.60$79.38$23814.2 months
RunPod
$2.49/GPU/hr
6× H10016001.9×6.3h$14.94$94.12$28243.5 months
Lambda
$2.99/GPU/hr
6× H10016001.9×6.3h$17.94$113.02$33912.9 months
GCP
$3/GPU/hr
6× H10016001.9×6.3h$18.00$113.40$34022.9 months
AWS
$3.9/GPU/hr
6× H10016001.9×6.3h$23.40$147.42$44232.2 months
Hours needed
Cloud is 1.9× faster, so you only need 6.3h to match 12h of local output.
Daily cost
6.3h × $12.60/hr = $79.38/day for equivalent work.
Payoff period
$9,899 ÷ $79.38/day = 125 days to break even.

Open Source Model APIs

Run open-weight models via inference providers like Together, Groq, and DeepInfra

Pay per token with zero infrastructure. The simplest path to production—but costs scale directly with usage.

Provider Comparison

Pay-per-token pricing for workload — 36.3M tokens/dayNote: Only providers offering all 2 models in your workload are shown.

ProviderInput $/1MOutput $/1MBlended $/1M$/day$/moPayoff
Together
gpt-oss-120b: $0.24/1Mgpt-oss-20b: $0.08/1M
$0.07$0.30$0.12weighted$4.29$1296.4 years
OpenAI
gpt-oss-120b: $4.00/1Mgpt-oss-20b: $1.76/1M
$1.43$5.73$2.29weighted$83.22$24974.0 months
API vs GPU Rental vs Local
Cheapest API: Together @ $4.29/day (6.4 years payoff) — 1752% cheaper than GPU rental

Proprietary Model APIs

Closed-weight models from OpenAI, Anthropic, Google, and other labs

ProviderInput $/1MOutput $/1MBlended $/1M$/day$/moPayoff
Google
Gemini 3 Flash: $1.00/1MGemini 2.0 Flash: $0.16/1M
$0.20$1.02$0.36weighted$13.06$3922.1 years
OpenAI
GPT-5: $3.00/1MGPT-5 Nano: $0.12/1M
$0.34$2.69$0.81weighted$29.24$87711.3 months
Anthropic
Claude Sonnet 4.5: $5.40/1MClaude 3 Haiku: $0.45/1M
$0.90$4.52$1.63weighted$59.10$17735.6 months
Amazon Bedrock
Claude Sonnet 4.5: $5.40/1MClaude 3 Haiku: $0.45/1M
$0.90$4.52$1.63weighted$59.10$17735.6 months
How this works
Each row shows what it would cost to run your exact workload using that provider's comparable models. Your workload spans 2 tiers, so each provider uses multiple models.

Bottom Line

Running 2 models for 12h/day on Mac Studio M3 Ultra (512GB):

Local Hardware
12h/day @ 840 tok/s
One-time: $9,899
GPU Rental (Denvr)
6.3h/day @ 1600 tok/s
$79.38/day = $2381/mo
API (Together)
$0.12/1M tokens
$4.29/day = $129/mo

Hardware takes 6.4 years to pay off vs API (Together) at this utilization level.

Why Local Matters

Beyond the economics: security, sovereignty, and control

For a product like AnswerLayer, customers in healthcare, finance, legal, and public sector can't send their data to third-party APIs. Data sovereignty is becoming the dominant paradigm—governments worldwide are mandating local storage and restricting cross-border transfers. Europe has issued over €5.65 billion in GDPR fines since 2018, with the EU AI Act adding new obligations in 2026.

Local inference keeps sensitive data entirely on-premises. No API logs, no third-party training pipelines, no policy changes from providers. For regulated industries, this isn't optimization—it's compliance.

When Cloud Wins

Convenience, guarantees, and the value of not managing infrastructure

Local hardware isn't always the answer. Cloud and API providers offer real advantages: no hardware procurement, no maintenance, no capacity planning. For teams without dedicated infrastructure expertise, managed services eliminate entire categories of operational burden.

SLAs matter too. Cloud providers guarantee uptime, offer redundancy across regions, and handle failover automatically. Local hardware is a single point of failure unless you invest in redundancy—which multiplies the cost.

A note on pricing: The cloud GPU rates above are spot/on-demand prices. Reserved instances and usage commitments can reduce costs 30-60%, which would extend the payoff period. The calculator shows the baseline case—your actual economics may differ.

The Open Source Option

When open-weight models match proprietary performance

The API providers above offer access to both proprietary and open-source models. What's changed is that open models now compete on quality. According to the Artificial Analysis leaderboard, open models like DeepSeek R1, Qwen 2.5, and Llama 4 now rival closed models on key benchmarks. Qwen has overtaken Llama as the most-downloaded base model for fine-tuning. For many applications—summarization, extraction, code generation—the gap with proprietary models has effectively closed.

There's another factor to consider: today's API prices are heavily subsidized by VC funding. OpenAI has raised over $78B and Anthropic over $33B—much of it deployed to capture market share at below-cost pricing. When these subsidies end (and they will), API costs could rise significantly. Google's pricing is an exception—not VC-dependent—but still subject to competitive pressure that may not last.

For applications where data stays local anyway, the choice isn't about quality anymore. It's about cost predictability, latency, and control.

Right-Sizing Models

When smaller models outperform giants

One insight worth highlighting: bigger isn't always better. Research shows that fine-tuned smaller models significantly outperform zero-shot generative AI models like ChatGPT and Claude on text classification tasks. Together AI documented a case where a fine-tuned Gemma 3 27B outperformed Claude Sonnet by 60% on healthcare scribing—at 10-100× lower inference cost.

With techniques like LoRA, teams can fine-tune models for specific domains with modest compute. The pattern is consistent: task-specific training data often beats raw parameter count—and these specialized models can run on hardware that fits under a desk.

The market signal: NVIDIA research shows that with ~100 labeled examples, a well-tuned small model reaches parity with large LLMs on specialized tasks.

Key Assumptions

This calculator makes several simplifying assumptions. It assumes consistent daily usage—real workloads are often bursty. Hardware depreciation and electricity costs are excluded, which favors local hardware. Cloud providers may impose minimum commitments not reflected here. We focus on the Mac Studio and DGX Spark because they're readily available—traditional GPU workstations (RTX 4090/5090) face ongoing supply constraints that make them difficult to source at reasonable prices. The calculator shows throughput (tokens/second) but not time-to-first-token latency, which matters for interactive applications.

This calculator is open source. Found an error in the data? Have a suggestion? Contributions welcome.