DeepSeek-V3 arrived in late 2025 with a pricing model that made every LLM budget spreadsheet obsolete overnight. At a fraction of the cost of GPT-4o or Claude Sonnet, it performs at near-parity on many practical tasks. This post examines what you actually get in production — not just the marketing numbers.
Architecture: Why It Is So Cheap
DeepSeek-V3 uses a Mixture of Experts (MoE) architecture with 671 billion total parameters, but only 37 billion are activated per token. This is the core reason for its cost efficiency: you pay for 37B parameter computation, but get the representational capacity of a 671B model.
The key architectural innovations:
- Multi-head Latent Attention (MLA): Reduces KV cache memory during inference by compressing attention heads into a latent space. This enables higher throughput at lower memory cost.
- DeepSeekMoE: Uses finer-grained expert routing than standard MoE models. Experts are smaller and more numerous, which improves specialisation without increasing inference cost proportionally.
- FP8 mixed-precision training: Trained on 2.664 trillion tokens using 8-bit floating point, dramatically reducing training cost while maintaining model quality.
The result is a model that DeepSeek trains and serves at approximately 1/10th the infrastructure cost of an equivalent dense model — and they pass most of that saving to customers.



