Real pricing data for GPT-4.1, Claude, Gemini, DeepSeek, and Llama. Tiered model strategy, prompt caching, and Router block implementation in AACFlow. A real 10x cost reduction case study.
Running AI workflows at scale has a cost structure that surprises most teams at the moment of their first significant production bill. A workflow that costs $0.02 per run during development can easily cost $2.00 per run in production once you add larger inputs, longer chains, and real-world variability. At 10,000 runs per day, that is $20,000 daily instead of $200.
The fix is not to switch to a cheaper model universally. It is to build a tiered strategy that matches model capability to task complexity — and to use caching aggressively. AACFlow gives you the Router block and the caching configuration to do this on the visual canvas without custom code.
The price spread between the cheapest viable model (Gemini 2.5 Flash at $0.15/$0.60) and Claude Opus 4 ($15.00/$75.00) is 100x on input and 125x on output. Using Opus for every task is the equivalent of hiring a senior partner to file expense reports.
Anthropic's prompt cache stores the computed key-value representation of your prompt prefix on Anthropic's servers. If the next request uses the same prefix, Anthropic returns the cached computation instead of recomputing it. Cached tokens cost $0.30/M instead of $3.00/M for Claude Sonnet 4 — a 90% reduction on the cached portion.
This is highly valuable for workflows where a large, stable system prompt accompanies every request. A document processing workflow that sends a 2,000-token system prompt with every document pays full price without caching. With caching enabled, the system prompt costs $0.30/M instead of $3.00/M.
In AACFlow: enable prompt caching in the AI Agent block configuration panel under Advanced Settings → Caching. AACFlow automatically structures the API request to include the cache_control header at the system prompt boundary. No manual API implementation required.
The principle: use the cheapest model capable of the task at each step. Reserve expensive models for steps where quality difference is measurable and matters to the business outcome.
Gemini 2.5 Flash or Llama 4 Scout. Tasks: sentiment classification, topic labelling, intent detection, document type identification, language detection, spam filtering. These are binary or small-N categorical decisions. A cheap model gets them right 95%+ of the time, which is good enough for routing.
GPT-4.1 mini, DeepSeek V3, Claude Haiku. Tasks: entity extraction, JSON generation from unstructured text, summarisation, translation, field population. The output is constrained (a schema), so quality is easier to validate. DeepSeek V3 is particularly strong here at a fraction of the cost of Claude Haiku.
GPT-4.1, Claude Sonnet, Gemini 2.5 Pro. Tasks: long-form content generation, complex analysis, code writing, customer-facing responses that must be polished. Use these only when the output is customer-facing or drives a consequential decision.
Claude Opus 4. Use for: legal analysis, medical triage support, high-stakes architectural decisions, rare cases where quality justifies the cost. If you are running more than 100 Opus calls per hour, audit whether they are all necessary.
The Router block in AACFlow evaluates a condition and sends the workflow to one of N branches. Use it to select the model tier based on task properties detected in an earlier step.
A typical document processing workflow:
Classifier block (Gemini 2.5 Flash): reads document metadata and first 500 tokens, outputs complexity: low | medium | high and type: invoice | contract | report | other
medium → Tier 2 path (GPT-4.1 mini with structured output)
high + contract → Tier 3 path (Claude Sonnet with caching)
high + report → Tier 3 path (Gemini 2.5 Pro)
Extraction/generation block: uses the model selected by the router
Validator block (Claude Haiku): checks output quality, routes to a Tier 3 fallback if confidence is below threshold
The router configuration in AACFlow uses a simple condition builder — no code required. Each route is a labelled edge to a different AI Agent block configured with the appropriate provider and model.
A team running an invoice processing workflow came to AACFlow with a single-model architecture: every invoice went through Claude Sonnet 4. Average cost: $0.12 per invoice. At 50,000 invoices per month, that is $6,000/month.
After optimisation with AACFlow:
70% of invoices (simple, standard format): Gemini 2.5 Flash classification + DeepSeek V3 extraction. Cost: $0.004 per invoice.
25% (moderate complexity): GPT-4.1 mini with structured output schema. Cost: $0.018 per invoice.
5% (complex, multi-page, unusual formats): Claude Sonnet with prompt caching. Cost: $0.08 per invoice (caching reduces system prompt cost by 85%).
That is a reduction from $6,000 to $565 per month — a 10.6x improvement — with no degradation in accuracy for the high-volume simple cases and the same quality for complex cases.
Budget caps: set a per-execution USD limit in the workflow settings. If a single run exceeds the cap, AACFlow halts the run and can route to a human reviewer or send a Slack alert.
Token limits: configure max input and output tokens at the AI Agent block level. This prevents runaway costs from unexpectedly large inputs.
Execution analytics: AACFlow's built-in cost analytics show per-run cost, per-model cost breakdown, and trends over time. Identify which workflow steps are consuming disproportionate budget.
Self-hosted models: for very high volume Tier 1 and Tier 2 tasks, AACFlow supports Ollama for local model execution and vLLM for self-hosted inference. At sufficient volume, the compute cost of self-hosted Llama 4 Scout can be lower than the API cost of Gemini 2.5 Flash.
The core principle holds across all of these controls: match model capability to task requirements, cache where inputs are stable, and make cost visible. AACFlow gives you the tooling to implement and monitor this without rebuilding your workflow infrastructure.