Alexandr Chibilyaev on the testing infrastructure of AACFlow — unit tests for blocks and connectors, integration tests with mocked API responses, Playwright E2E tests, automated health checks, and the philosophy that keeps 60,000+ developers running reliably.
Every month, AACFlow executes over 10 million agent workflows. Each workflow chains together blocks — AI calls, API integrations, logic gates, data transformations — across 170+ connectors to systems ranging from OpenAI to 1C to Chestny Znak. If a single block behaves incorrectly, the agent fails. If a connector's authentication breaks, the knowledge base goes stale. If the visual editor renders a node in the wrong position, the user's workflow becomes unreadable.
Testing this isn't optional. It's existential.
Here's the testing infrastructure that keeps 60,000+ developers running without losing their minds — or ours.
First, let's define what needs testing. AACFlow's testing surface is large:
213+ blocks. Each block is a self-contained unit of execution: an LLM chat call, an HTTP request, a database query, a string transformation, a conditional branch. Each block has configuration parameters, input/output schemas, error handling, and retry logic. Each must work in isolation and in composition with any other block.
170+ connectors. Each connector integrates with an external API: AmoCRM, Wildberries, 1C, PostgreSQL, Gmail, Telegram, and 165 others. Each connector handles authentication (OAuth 2.0, API keys, cryptographic signatures), data extraction, pagination, rate limiting, error recovery, and tag mapping from external data formats to our unified knowledge base schema.
The DAG executor. The execution engine that compiles a visual graph into a directed acyclic graph, schedules nodes for parallel execution, handles state persistence, resume-from-failure, and idempotency. This is the hardest component to test — its behavior depends on timing, network conditions, and the complex interactions of hundreds of blocks.
The visual editor. ReactFlow-based canvas with drag-and-drop, real-time multiplayer via Socket.IO, node configuration panels, connection validation, and sub-workflow nesting. UI testing here means testing not just rendering but user interaction sequences.
The AI API layer. Provider routing, token counting, credit deduction, streaming, fallback chains, and the 15+ provider adapters that translate between our unified format and each provider's native API. Every provider update is a potential breaking change.
Authentication and authorization. Better Auth with SSO (SAML/OIDC), SCIM provisioning, RBAC with custom roles, workspace isolation, and API key management. Security testing isn't optional — a broken permission check means data leakage.
Internationalization. 1,000+ translation keys across three locales. Missing keys, broken ICU syntax, and locale-specific formatting errors must be caught before they reach users.
Testing all of this comprehensively isn't about writing "enough" tests. It's about testing the right things, at the right level, with the right tools.
Unit tests verify individual functions, classes, and modules in isolation. They're fast (milliseconds), deterministic, and run on every push. They're the first line of defense.
Block unit tests. Every block has a test file that verifies:
Input validation. Does the block reject malformed input with a clear error message? Does it accept valid input and execute?
Output schema. When given known input, does the block produce output matching its declared schema?
Error handling. When the underlying operation fails (network error, rate limit, invalid API response), does the block throw the correct error type with context?
Retry logic. When configured with retries, does the block retry the correct number of times with the correct backoff strategy?
Parameter resolution. When a parameter references a variable from a previous node ({{node_45.output.email}}), does the resolver correctly evaluate the reference?
A single block test is compact — often 30-50 lines. But with 213 blocks, the total is significant. Each block test runs in under 10ms. The full block test suite completes in under 5 seconds.
Connector tag mapping tests. Every connector extracts structured data from an external API and maps it to standardized tags in the knowledge base. The mapping logic is where most connector bugs live: a field that's present in the API's "v2" response but missing in "v1," a date format that differs between sandbox and production, a nested object that needs flattening.
expect(tags).toHaveLength(1)// Only the required fields
49
})
50
51
it('handles missing fields without throwing',()=>{
52
const mockContact ={id:12345}// Minimal payload
53
54
const tags =mapTags(mockContact,'contact')
55
56
expect(tags).toBeDefined()
57
expect(Array.isArray(tags)).toBe(true)
58
})
59
})
The mapTags.test.ts pattern is standardized across all 170+ connectors. The test file structure, mock generation, and assertions follow the same template. This consistency means any engineer can jump into any connector's tests and understand them immediately. It also means we can generate the test scaffold for new connectors automatically.
Integration tests verify that components work together correctly. A block calling a real (or realistically mocked) external API. A connector authenticating against a sandbox environment. The executor running a multi-node workflow end-to-end.
Mocked API responses. We don't call live third-party APIs in our CI pipeline. It's unreliable (APIs go down), expensive (some APIs charge per request), and slow (network latency adds up). Instead, we use nock (HTTP interception) and factory-generated mock responses:
const result =await gmailConnector.fetch({query:'from:orders@example.com'})
20
21
expect(result.documents).toHaveLength(1)
22
expect(result.documents[0].tags).toContainEqual({
23
id:'subject',
24
displayName:'Subject',
25
value:'Order Confirmation #12345',
26
})
27
})
The nock interceptor captures the HTTP request. If the connector makes a request that doesn't match any interceptor, the test fails — catching both incorrect URLs and unexpected API calls.
Sandbox environments. For critical connectors (Stripe, AmoCRM, 1C), we maintain sandbox environment credentials and run a nightly integration suite against the actual sandbox APIs. These tests catch provider-side changes — a deprecated endpoint, a renamed field, a new required parameter — before they affect production users.
Nightly sandbox tests are slower (5-10 minutes) and occasionally flaky (sandbox environments have their own reliability issues), so they don't block PR merges. They run separately and report failures to the engineering channel for triage.
E2E tests simulate real user behavior: opening the browser, navigating to the dashboard, creating a workflow, connecting blocks, executing, and verifying the result. We use Playwright for E2E testing.
awaitexpect(page.locator('[data-testid="node-output"]')).toContainText('Hello from E2E test')
34
})
E2E tests are the slowest (1-3 minutes each) and the most expensive to maintain (UI changes break selectors). We keep the E2E suite focused on critical user paths: workflow creation and execution, authentication flows (login, SSO, password reset), billing (upgrade, downgrade, invoice download), and workspace management (create, invite, remove member).
For everything else — block configuration variants, connector parameter combinations, edge cases — we rely on unit and integration tests. The E2E suite answers one question: "Can a user accomplish the core tasks?" Everything else is tested faster and more reliably at lower levels.
├── Next.js build — verifies no compilation errors
19
└── Bundle size analysis — warns on regressions
20
21
5.E2ETests(8m)
22
├── Playwright — critical user paths
23
└── Screenshot diff on failure for debugging
24
25
6. i18n Parity Check(30s)
26
└── sync-i18n.py — no missing or extra keys
27
28
Total pipeline:~15 minutes
Turborepo caches unchanged packages. If you only modify a connector, only that connector's tests run — not the full 170-connector suite. This keeps pipeline times manageable despite the large test surface.
The pipeline is not optional. No branch merges without passing CI. No exceptions for "trivial changes." A one-line CSS fix can break the build if it introduces a TypeScript error. The pipeline is the gatekeeper — and the gatekeeper doesn't make exceptions.
The AI API layer integrates with 15+ LLM providers: OpenAI, Anthropic, Google, DeepSeek, Groq, Together, Fireworks, Mistral, GigaChat, YandexGPT, and others. These providers change their APIs constantly — new models, deprecated endpoints, modified response formats, adjusted rate limits.
We cannot wait for a user to report "my agent stopped working" to discover that OpenAI changed their error response format.
So we run automated health checks every hour against every provider:
1
for(const provider of configuredProviders){
2
try{
3
const response =await provider.chat({
4
messages:[{role:'user',content:'Reply with only the word: healthy'}],
alertEngineering(`Provider ${provider.id} is down: ${error.message}`)
18
}
19
}
The health check sends a minimal prompt to each provider and verifies the response. If a provider is down, degraded, or returning unexpected responses, the engineering team is alerted within minutes — not hours later when users start filing tickets.
We also track provider latency and error rates over time. When a provider's p95 latency doubles, we investigate. When their error rate crosses 1%, we route traffic away. The health check data feeds into the routing layer's provider selection logic.
Reusable test infrastructure lives in the @aacflow/testing package — a shared library of factories, builders, mocks, and assertions used across the entire codebase:
Factories.createMockBlock(), createMockConnector(), createMockWorkflow(), createMockExecution(), createMockUser(), createMockWorkspace(). Each factory generates a realistic, valid object with sensible defaults. Tests override only the fields they care about:
Mocks. Pre-built nock interceptors for every connector: mockGmailApi(), mockAmoCrmApi(), mockStripeApi(), mockWildberriesApi(). Each mock returns realistic data in the connector's expected format, with configurable overrides for testing edge cases.
Assertions. Custom Vitest matchers for common verification patterns: expect(block).toHaveValidOutputSchema(), expect(connector).toMapTagsCorrectly(), expect(execution).toHaveTraceEvent(eventType).
The testing package is the backbone that makes writing new tests fast. When adding a new connector, the developer imports the base connector test template, wires in the mock API responses, and writes 5-10 specific assertions for that connector's unique behavior. The rest is handled by shared infrastructure.
Our codebase has one hard rule about test mocks: never use vi.doMock with dynamic imports.
vi.doMock + dynamic import() creates a dependency on module evaluation order. Tests pass or fail depending on whether the mock was registered before or after the module was imported — a behavior that varies between test runners, test file ordering, and even individual test runs.
Instead, we use static mocks at the top of the test file:
1
// CORRECT: Static mock at module level
2
vi.mock('@/lib/auth',()=>({
3
getCurrentUser: vi.fn(),
4
requireAuth: vi.fn(),
5
}))
6
7
// WRONG: Dynamic mock inside a test
8
it('does something',async()=>{
9
vi.doMock('@/lib/auth',()=>({...}))
10
const{ doSomething }=awaitimport('./module')
11
})
Static mocks are hoisted by Vitest to the top of the file, ensuring they're registered before any imports execute. They're predictable, debuggable, and never produce "it passes on my machine" bugs.
This rule is enforced by linting. Any vi.doMock in the codebase triggers a CI failure with a message linking to the testing AGENTS.md that explains the rule and the correct pattern.
Coverage percentage is a vanity metric. 100% line coverage with tests that assert "the function was called" tells you nothing about whether the system works correctly.
Our coverage philosophy:
Test observable behavior. Does the block produce the correct output for a given input? Does the connector handle a 429 rate limit response by retrying with backoff? Does the executor resume a paused workflow from the correct node? These are behavioral assertions — they verify that the system does what it's supposed to do.
Don't test internal implementation. Testing that a function calls fetch() with specific arguments is brittle. If we change from fetch() to axios, the test breaks even though the behavior is identical. Instead, we mock at the network boundary (with nock) and test that the connector produces correct output from a given HTTP response. The implementation details of how it makes the HTTP call are irrelevant.
Test error paths as thoroughly as success paths. Most bugs live in error handling. When an API returns a 500 with an HTML error page instead of JSON, does the connector handle it gracefully? When the database connection pool is exhausted, does the executor retry with backoff or crash? When a translation key is missing for a locale, does the UI show a clear error or render [object Object]?
Coverage threshold is a floor, not a ceiling. We enforce 80% line coverage. Below that, the build fails. But the threshold is a safety net, not a target. The goal isn't 80% — it's "every behavior that matters has a test that would catch a regression."
The 60,000+ developers using AACFlow rely on the platform to be stable. But the platform is also evolving rapidly — new blocks, new connectors, new features every month. How do we move fast without breaking things?
The answer is the test suite.
When we add a new connector — say, a new Russian marketplace like KazanExpress — the test infrastructure makes it safe. The connector inherits the standard test template. The mock API responses are generated by the testing package. The tag mapping is verified by the standard mapTags.test.ts pattern. If the new connector accidentally breaks an existing one (unlikely, since connectors are isolated), the full connector test suite catches it within 2 minutes in CI.
When we refactor the DAG executor — changing the scheduling algorithm from depth-first to priority-queue — the executor test suite runs 200+ workflow scenarios and verifies that execution order, parallelism, error handling, and state persistence all work correctly. Without that test suite, the refactor would be terrifying. With it, it's a confident change.
When we upgrade a dependency — Next.js 15 to 16, or React 19 to 20 — the full pipeline confirms that the build succeeds, the type checker is happy, the unit tests pass, and the E2E tests verify that users can still log in, create workflows, and execute them. A dependency upgrade that would take days of manual QA takes 15 minutes of CI.
Testing is not overhead. Testing is what makes velocity sustainable at scale.
Testing 170+ connectors and 213+ blocks isn't about writing more tests. It's about building the infrastructure — the testing package, the mock generators, the standardized test patterns, the CI pipeline, the automated health checks — that makes writing good tests fast and writing bad tests impossible.
The key decisions that made this work:
Standardize test patterns. Every connector has the same test structure. Every block has the same test template. Consistency reduces cognitive load and enables automation.
Mock at the right boundary.nock intercepts HTTP. vi.mock is static at the module level. Dynamic mocking is forbidden. The mocking strategy is opinionated and enforced — because ambiguity in mocking leads to flaky tests.
Test behavior, not implementation. Tests that verify "the function calls X" break on every refactor. Tests that verify "given input A, the output is B" survive refactors and catch real regressions.
Automate the boring parts. Provider health checks run every hour. i18n parity checks run on every push. Test coverage is enforced in CI. Humans are bad at remembering to do repetitive checks — automation never forgets.
Make testing fast. The unit test suite completes in under 5 seconds. The full CI pipeline completes in 15 minutes. Fast tests mean developers actually run them — and catch issues before they reach code review.
Reliability at scale doesn't come from careful manual QA. It comes from an automated testing infrastructure that catches regressions before they ship — and from a culture that treats testing as a first-class engineering discipline, not an afterthought.