Alexandr Chibilyaev reveals the sync engine powering AACFlow knowledge bases: incremental sync, content hashing, cursor pagination, adaptive rate limiting, parallel execution, and bulletproof retry logic.
A knowledge base with stale data is worse than no knowledge base at all. An AI agent that acts on last week's CRM records will make decisions that cost real money โ wrong follow-ups, missed deals, incorrect inventory counts.
The problem isn't "getting data in." It's keeping it fresh โ across 170+ connectors, each with different APIs, pagination schemes, rate limits, and quirks โ without melting your infrastructure.
This is the story of AACFlow's sync engine. It processes millions of documents daily, detects changes with surgical precision, and handles failure with the discipline of a database replication system.
The most important optimization in the engine is incremental sync. Without it, every sync run would re-fetch, re-hash, and re-embed every document โ even if 99.9% of them haven't changed.
Connectors that declare supportsIncrementalSync: true receive a lastSyncAt timestamp when listDocuments is called:
The connector filters at the API level โ meaning only changed documents travel over the network. A knowledge base with 50,000 Confluence pages where 12 were edited today? Only 12 documents get fetched.
For connectors that don't support server-side filtering (and many don't, or implement it unreliably), the engine falls back to content hashing.
Every document stored in the knowledge base carries a contentHash. The sync engine computes it on first sync and stores it alongside the document. On every subsequent sync:
Connector returns a document (whether from incremental filter or full listing)
Engine computes the hash of the incoming content
Engine compares against the stored hash
If incomingHash === storedHash โ document is unchanged, skip processing
If incomingHash !== storedHash โ document was modified, re-embed and update
The hash function is deliberately fast and collision-resistant. We use SHA-256 on the normalized text content โ not the raw API response, which might contain timestamps or metadata that changes without the actual content changing.
Some sources have millions of documents. Fetching them all in one API call is impossible โ and would violate every rate limit on the planet. Every connector's listDocuments uses cursor-based pagination:
The engine runs this loop automatically. The connector just returns { documents, nextCursor, hasMore }. The engine handles:
Rate limit pauses โ if the connector hits a 429, the engine waits and retries
Progress tracking โ the UI shows "Syncing page 12 of ~50..."
Partial failure recovery โ if page 23 fails with a transient error, the engine retries from page 23, not from page 1
The cursor is opaque to the engine. It's an implementation detail of each connector โ could be a page token, an offset, a timestamp, whatever the source API uses.
Here's a problem that bit us early: expensive lookups repeated on every page.
Take the Confluence connector. To extract clean text from Confluence pages, the connector needs to know the schema of labels, spaces, and content types. Fetching that schema on every page of a 500-page sync is 499 redundant API calls.
The solution: syncContext. A mutable object the engine creates at the start of a sync run and passes to every listDocuments and getDocument call. Connectors can stash expensive lookups there:
1
// Airtable connector using syncContext to cache field name lookups
syncContext lives for the duration of a single sync run and is discarded afterward. No memory leaks. No stale caches between syncs. No global mutable state. Just clean, scoped caching.
Some connectors deal with content that's expensive to fetch in bulk. A CRM might return 10,000 deals in a list endpoint โ but each deal's full description, comments, and attachments require a separate API call. Fetching all 10,000 full documents on every sync would be prohibitively slow.
The contentDeferred flag solves this:
1
// In listDocuments: return minimal documents with contentDeferred=true
2
{
3
externalId:'deal-12345',
4
title:'Enterprise Deal โ Company X',
5
content:'',// Empty content
6
contentHash:'',// Will be computed after fetch
7
mimeType:'text/plain',
8
contentDeferred:true,// โ Engine will fetch full content lazily
9
metadata:{price:500000,status:'negotiation'}
10
}
When the engine sees contentDeferred: true:
It stores the document skeleton immediately (title, metadata, tags)
For new documents only (not previously stored), it calls getDocument(externalId) to fetch the full content
For existing documents, it compares a lightweight hash (often computed from lastModified + key metadata fields) โ only calling getDocument if the lightweight hash differs
Once full content is fetched, it's embedded and the contentDeferred flag is cleared
This is a massive optimization. On an initial sync of 50,000 CRM deals, the engine fetches 50,000 full documents โ necessary. But on the next incremental sync, only 200 deals changed โ so only 200 getDocument calls. The other 49,800 are skipped based on hash comparison.
Every API has rate limits, and they're almost always lower than documented. The engine uses adaptive rate limiting โ it doesn't trust docs, it learns by doing.
Start conservative โ 1 request per second for unknown APIs
Ramp up โ increase throughput by 50% every 10 successful requests
Back off on 429 โ when a rate limit is hit, drop to 25% of current rate
Stabilize โ converge on the maximum sustainable throughput for that specific API
Different connectors stabilize at different rates. Gmail settles at ~20 req/s. Confluence at ~5 req/s. Some Russian government APIs stabilize at 0.5 req/s. The engine learns each one independently and persists the learned rate for future syncs (with a slow decay to re-test periodically).
Users don't sync one connector at a time. They connect 10, 20, 50 knowledge sources and expect them all to stay fresh.
The engine runs parallel syncs โ up to a configurable concurrency limit per workspace. Each connector sync runs in its own async context with its own rate limiter, its own retry state, and its own syncContext:
1
Workspace "Acme Corp" โ 12 connectors syncing:
2
โโ Confluence [=======>]67% โ Page 34/50
3
โโ Gmail [============>]89% โ Page 412/460
4
โโ AmoCRM [==>]18% โ Page 3/15
5
โโ Notion [==========>]78% โ Page 89/112
6
โโ 1C:Enterprise [======>]52% โ Page 7/13
7
โโ ...7 more [various progress]
Resource isolation is per-connector. A rate-limited government API crawling at 0.5 req/s doesn't slow down the Gmail sync running at 20 req/s. Each connector's rate limiter is independent.
docsAdded: number // New documents discovered and embedded
3
docsUpdated: number // Existing documents with changed content
4
docsDeleted: number // Documents no longer present in the source
5
docsUnchanged: number // Documents that haven't changed (skipped)
6
docsFailed: number // Documents that failed after all retries
7
error?: string // Top-level error if the entire sync failed
8
}
This isn't just a log line. It's the contract between the sync engine and the user. A sync that reports { added: 0, updated: 5, deleted: 0, unchanged: 49995, failed: 0 } on a 50,000-document knowledge base is a triumph of efficiency. Five embeddings computed, 49,995 skipped, zero failures.
When docsFailed > 0, the engine preserves the error details per-document so the user can investigate. A failed document isn't silently dropped โ it's flagged in the UI with the specific error.
A document that exists in the knowledge base but is absent from the source API has been deleted. The engine detects this after a full listing completes:
After processing all pages, the engine queries all stored externalId values for the connector
It compares against the set of externalId values returned by the source
Documents in the DB but not in the source are marked as deleted (soft-delete โ flagged, not removed, so the vector index stays consistent for existing embeddings)
Deleted documents are excluded from search results but preserved for audit trails
Incremental sync complicates this. If you last synced at 2:00 PM and only asked for documents modified after 2:00 PM, documents deleted at 2:30 PM won't appear. The engine handles this by periodically running a "full reconciliation" pass โ a complete listing that detects deletions โ on a configurable schedule (daily by default, hourly for frequently-changing sources).
The knowledge base is a shared resource. What happens when User A triggers a manual sync while the scheduled sync is still running? The distributed Redis lock prevents duplicate concurrent syncs. Only one sync runs at a time per connector.
But what about metadata edits? User A might edit a document's tags while the sync engine is updating the document's content. The engine uses optimistic concurrency: it reads the document's updatedAt before writing and compares it. If the document was modified by a user between read and write, the engine merges: user-edited metadata is preserved, engine-updated content is stored.
APIs lie about their rate limits. Every single one. The only reliable source of truth is empirical testing. Our adaptive rate limiter was born from the painful experience of trusting docs.
Incremental sync is a minefield of edge cases. What if the source clock is skewed? What if modifiedAfter returns documents modified exactly atlastSyncAt (duplicate) or one second after (missed)? What if the source deletes and re-creates a document with the same ID? We handle all of these through a combination of content hashing, grace periods, and periodic full reconciliation.
Progress visibility is critical for trust. A sync that runs for 45 minutes with no visible progress feels broken. We invested heavily in real-time progress reporting: page numbers, document counts, estimated time remaining. Users trust what they can see.
Parallelism is a force multiplier. A workspace with 20 connectors syncs in the time it takes the slowest connector, not the sum of all connectors. This alone makes the difference between a usable platform and one where "sync" means "go make coffee."
The sync engine is invisible infrastructure. When it works โ and it works 99.9% of the time across 10M+ monthly agent executions โ nobody thinks about it. And that's exactly the point.