DeepSeek V4 Is Live: When to Route to It, When to Skip It

Two model variants, one MIT licence, a million-token context window, and a price per million tokens that is roughly one-seventh of GPT-5.5. DeepSeek V4 is the first 2026 release that genuinely changes the math on AI-employee routing.

What shipped

DeepSeek shipped two production models on 24 April 2026: V4-Pro (1.6 trillion total parameters, 49 billion active per token via Mixture-of-Experts) and V4-Flash (284 billion total, 13 billion active). Both ship under MIT — unrestricted commercial use — with weights live on Hugging Face. Pre-training ran on 32 trillion+ tokens. Both models default to a 1M-token context window with up to 384K tokens of output.

The architectural story is two named pieces: Compressed Sparse Attention layered with Heavily Compressed Attention — DeepSeek’s own DSA lineage — plus Manifold-Constrained Hyper-Connections and the new Muon optimizer. The headline number is the cost of long-context inference. At a 1M-token prompt, V4-Pro uses roughly 27% of the per-token FLOPs and 10% of the KV-cache of V3.2. That is what makes the 1M context window a practical primitive for agents instead of a marketing slide.

Both models support three reasoning modes — Non-Think, Think High, and Think Max. Think Max requires at least 384K tokens of context and burns proportionally more output tokens. The legacy deepseek-chat and deepseek-reasoner endpoints are deprecated and fully retire on 24 July 2026; teams already calling those endpoints should migrate now.

Where V4-Pro wins, where it loses

We re-routed a sample of production agent traffic to V4-Pro Think Max for three weeks before publishing this. The benchmark posture matches our observed behaviour:

Coding wins. 93.5 on LiveCodeBench (SOTA among open models), Codeforces rating 3,206, SWE-bench Verified 80.6 — within 0.2 points of Claude Opus 4.6 on the agentic-software task that matters most.
Knowledge loses. MMLU-Pro 87.5 (vs Gemini 3.1 Pro at 91.0), GPQA Diamond 90.1 (vs 94.3), MRCR 1M long-context recall 83.5 (vs Opus 4.6 at 92.9). Anything that depends on rare-fact retrieval over very long context is still better served by Gemini.
Math is split. IMOAnswerBench 89.8 — strong, but trails GPT-5.4 xHigh at 91.4. The gap matters for symbolic-math tasks; closes for everything else.

Coverage from TechCrunch and MIT Tech Review framed V4 as “3–6 months behind frontier.” That framing is right on the average, wrong in the cells. On code, V4-Pro is at the frontier. On long-context recall, it is a year behind Gemini.

Pricing reality (and the arbitrage)

The official DeepSeek API lists V4-Pro at $1.74 input / $3.48 output per million tokens, with cached-input at ~$0.14. V4-Flash sits at $0.14 / $0.28— roughly twelve times cheaper for a one-to-two-point benchmark loss. A launch promo through 31 May 2026 takes 75% off the official rates.

But the same MIT-licensed weights are available across multiple inference providers, and the price-per-1M varies up to fivefold:

OpenRouter: $0.435 / $0.87 — cheapest, but check the underlying provider per request.
DeepInfra and Fireworks AI: $1.74 / $3.48 — list price; Fireworks supports function calling but not vision or JSON mode at time of writing.
Together AI: $2.10 / $4.40 — most expensive, and the context window is capped at 512K, not 1M. Easy to miss in a copy-paste integration.
NVIDIA NIM: available via build.nvidia.com for evaluation.

A managed routing layer captures this arbitrage automatically. A direct API integration pays whatever sticker the chosen vendor charges, and re-discovers the cheaper path every quarter when finance asks why the bill grew.

The risk side nobody is writing about

NIST’s CAISI evaluation of recent DeepSeek models found a 94% jailbreak compliance rate and observed the model was roughly 12× more likely to follow malicious instructions than current US frontier models. Cisco, CSIS and other groups have flagged systematic censorship of CCP-sensitive topics (~85% suppression in earlier evaluations) and code-quality regressions for prompts referencing certain groups. These are real findings, and the V4 model card does not address them directly.

That does not disqualify V4 from production use. It changes the architecture of the call: a routing layer with policy guards, an output-side classifier, and a human approval gate for sensitive actions makes V4 deployable. A direct API call from an internal tool to the official DeepSeek endpoint, with no policy layer, does not.

Where it goes in our routing matrix

Three weeks of production traffic later, this is how V4 sits in our routing rules:

V4-Pro Think Max — agentic coding tasks, repo-scale refactors, and long-context document review where the cost-per-task with Opus 4.7 had been the operating constraint. Roughly 30% of code-generation traffic re-routed.
V4-Flash — high-volume background classification, summarisation and triage where the per-token cost of GPT-5 had been adding up. The 12× price delta vs V4-Pro funds the routing experiment.
Reserved for Opus 4.7 / GPT-5.5 / Gemini 3.1 Pro: long-form customer-facing writing, contract review, knowledge-heavy retrieval where MMLU-Pro and MRCR matter, and any sensitive-topic surface where the safety findings haven’t been mitigated by an output classifier yet.

Net: V4 is in the stack on day one. It is not the new default for everything. The point of running a managed routing layer is that this kind of decision happens once, per task type, by us — and stops being a procurement cycle every time a new model ships.

If you want to see how V4 fits alongside the other 31 vendors and surfaces we route across, the AI stack page is the directory. If you want the version of this for the dev-tooling layer instead, the Claude Code vs Cursor vs Codex comparison covers what we actually run.

— The Rebotify team

DeepSeek V4 is live. Here is when to route to it — and when to skip it.