Most advice you'll read online boils down to "just use GPT" or "always go open". Both are wrong defaults. The real question is what you're optimising for: cost, quality, privacy, or vendor risk. Answer that first, and the model picks itself.
The honest cost picture (March 2026)
Rough per-million-token figures — input / output, blended. Treat as ballpark; providers shuffle prices every quarter.
Claude Sonnet 4.6 — around $3 in / $15 out
Claude Haiku — around $0.80 in / $4 out
GPT-5 Mini — around $0.50 in / $2 out
Qwen 3 72B (hosted) — around $0.60 in / $0.90 out
Self-hosted Llama 3.3 70B — zero per-token, ~$2 per GPU-hour + ops time
Here's the part people skip: inference cost is almost never your dominant cost. Engineering time, eval time, and prompt-caching efficiency dwarf the sticker price at any realistic volume. Optimising for cheap tokens before you've built the feature is premature.
If you want quality at scale → Claude Sonnet or Opus
Sonnet is the current default for most builders we know. Strong tool use, strong reasoning, strong instruction-following across long context. When we say "this agent does the right thing 9 times out of 10", we usually mean Sonnet behind it.
Opus is the upgrade when Sonnet can't hit the bar — complex multi-step reasoning, nuanced policy questions, or tasks where a single wrong answer is expensive. Twice the cost, sometimes worth it.
If you want cheapest-that-works → Haiku or Qwen
For classification, simple summaries, template filling, and data extraction, Haiku is more than enough and costs a fraction of Sonnet. Qwen 3 72B on a managed host is in the same bracket and very capable for non-English content.
Where they struggle: multi-step tool use, long-running agents, and anything that requires holding a lot of subtle instructions at once. If you need those, pay up.
If privacy matters → self-host
Self-hosted is the right call when your data genuinely cannot leave your VPC — regulated fintech, health data, or corporate customers with hard contract terms. The real cost isn't the GPUs; it's the person who has to keep vLLM running at 3am.
Llama 3.3 70B on a single H100 is a perfectly workable setup for internal tools. Beyond that you're into the same ops burden as running your own database — doable, not free.
If you want vendor flexibility → route through the AI Gateway
One config line switches providers. This is underrated. It lets you AB-test Claude against GPT on real traffic without a code change, fail over automatically during a provider outage, and walk away from a vendor that jacks prices.
We default to the Vercel AI Gateway for this reason alone. Even if you only ever call Claude, the option value of being able to move is worth it.
The matrix
If you're building...
Internal tools → start with Haiku. Cheap, fast, usually enough.
Customer-facing chat → Sonnet. Quality and safety matter more than cost when your brand is on the line.
Agents and tool use → Sonnet or Opus. Tool-use quality is still the differentiator.
Heavy document processing → Sonnet with prompt caching. The cache savings are huge on repetitive long contexts.
Privacy-critical → self-host Qwen 3 or Llama 3.3. Worth the ops pain only if the data truly can't leave.
Prototyping → whatever you already have an API key for. Ship first, optimise later.
The real answer
Pick one. Ship something. Measure. Switch if it doesn't work. The Gateway makes switching cheap, so don't spend weeks agonising about the decision. A shipped product on the wrong model beats a perfectly modelled product that never shipped.