Alexandr Chibilyaev on the unified AI API layer that gives AACFlow agents access to 15+ LLM providers through a single interface — with intelligent routing, dynamic credits, and cost optimization built in.
An AI agent is only as smart as the model driving it. But which model? GPT-4o for complex reasoning? Claude Haiku for speed? Gemini Flash for cost efficiency? DeepSeek V3 for code generation? The answer changes based on the task, the budget, and the moment.
Most platforms force you to pick one provider and stick with it. We built something different: a unified AI API layer that gives every AACFlow agent access to 15+ LLM providers through a single interface. Here's how it works.
Each has its own API format, authentication scheme, error codes, and pricing model. Building an agent that can use "the best model for the job" means writing and maintaining 15 different API clients. Nobody should have to do that.
The routing logic is the brain of the layer. Given a chat request, the router decides:
Explicit preference. If the workflow specifies "use Claude Opus," the router obeys. Full user control when you need it.
Capability matching. If the request includes tool calls, the router filters to providers that support native tool calling. If the request is a simple text completion, any provider qualifies.
Cost optimization. For bulk, low-stakes tasks (summarize 1,000 support tickets), the router picks the cheapest capable model. For a single high-stakes task (draft a legal contract), it picks the most capable model.
Load balancing. If the primary provider is rate-limited or slow, the router shifts traffic to an equivalent alternative provider automatically.
Fallback chains. Configurable per workspace: "Try Claude Sonnet first. If unavailable or too slow, fall back to GPT-4o. If that fails, use Gemini Pro. If everything fails, queue for retry."
Adding a new provider means writing a ~200-line adapter. The router, credit system, error handling, and observability all work automatically with the new provider once the adapter is registered.
// Each chunk is a few tokens — display immediately
5
console.log(chunk.content)
6
}
Not all providers support streaming, and those that do implement it differently. The adapter pattern handles this: providers that support streaming implement buildStreamRequest and parseStreamChunk. Providers that don't have their responses simulated as a single-chunk stream. The agent code doesn't change.
Some users — especially enterprises — want to use their own API keys. They have negotiated rates with OpenAI, or compliance requirements that prohibit third-party key usage.
Our BYOK model supports this. Users can:
Bring their own keys — the platform uses the user's keys for API calls
Use AACFlow credits — the platform's keys, managed credits
Mix and match — some providers via BYOK, others via AACFlow credits
When BYOK is active, the routing layer injects the user's key instead of the platform key. Credit deductions are skipped for BYOK calls. But all other platform features — routing, fallback, observability, error handling — still work.
The business model works because we provide value beyond token resale: the visual editor, the DAG executor, the collaboration, the observability. We win by being the best place to run agents, not just by being a middleman for tokens.
What the request and response were (for debugging)
What routing decision was made and why
How many credits were deducted
This data powers the execution traces in the workflow viewer. When an agent makes a wrong decision, you can trace back to the exact LLM call, see the prompt, the response, and determine: was it a bad model choice? A bad prompt? A routing mistake?
Provider APIs change constantly. New models, deprecated endpoints, pricing updates, rate limit adjustments. We run automated tests against every provider every hour to catch breaking changes before users do.
Latency varies wildly. The same model from the same provider can respond in 200ms or 20 seconds. Our routing layer tracks latency percentiles per provider and factors them into routing decisions.
Token counting is imprecise. Every provider counts tokens differently. Our credit system uses provider-reported token counts when available, with fallback to our own tokenizer estimates. The difference is small enough that it averages out over thousands of calls.
Fallback is a superpower. During the March 2025 OpenAI outage, AACFlow agents continued working because the router automatically shifted to Anthropic and Google. Users didn't notice. That's the level of reliability production AI infrastructure requires.
Model performance benchmarking — automated quality evaluation so the router can pick not just the cheapest model, but the best-performing one for each task type
Fine-tuned model hosting — support for user-provided fine-tuned models alongside the standard providers
Multi-model ensembles — send the same request to multiple providers and select the best response
Cost forecasting — predict monthly spend based on usage patterns before the bill arrives
The AI API layer is the nervous system of every AACFlow agent. It's invisible when it works — and that's the point.