How to build a high-performance RAG pipeline using pgvector 0.7 and AACFlow Knowledge Base — automatic chunking, hybrid search, HNSW indexes, and sub-200ms responses at 10M document scale.
Most RAG implementations never make it to production. They work fine with 10,000 documents in a demo, then collapse under the weight of a real corpus — slow queries, irrelevant results, operational overhead from a separate vector database. This post explains how AACFlow's Knowledge Base, backed by pgvector 0.7, solves these problems and walks through building a production-grade RAG pipeline that responds in under 200ms with 10 million documents.
The instinct when building RAG is to reach for a dedicated vector database — Pinecone, Qdrant, Weaviate. These are strong products, but they add infrastructure complexity that most teams do not need and cannot maintain well.
pgvector 0.7 changes the calculus. The latest version ships two mature index types: IVFFlat and HNSW. HNSW (Hierarchical Navigable Small World) delivers near-linear query performance — a 10M vector search completes in under 5ms on commodity hardware. The accuracy trade-off is configurable via the ef_search parameter.
The decisive advantage of pgvector is that embeddings live in the same database as the rest of your application data. This means:
SQL joins work: filter by document owner, workspace, tag, or any metadata field with a standard WHERE clause.
Transactions work: embed and store in a single atomic operation — no risk of embedding succeeding while document storage fails.
EXPLAIN ANALYZE works: you can profile and optimize vector queries with the same tools you use for everything else.
One less service to operate: no separate vector database cluster, backup policy, authentication layer, or cost line.
For most production deployments, the combination of pgvector + PostgreSQL outperforms a separate vector database when total system complexity is factored in.
AACFlow's Knowledge Base is a first-class product feature built on pgvector. When you add documents — via file upload, Google Drive connector, Confluence sync, or the API — AACFlow handles the full ingestion pipeline automatically.
Automatic chunking splits documents into semantically coherent pieces. The default chunk size is 512 tokens with a 64-token overlap. For structured documents (code files, tables, legal contracts), AACFlow uses structure-aware chunking that respects paragraph and section boundaries rather than cutting arbitrarily at token limits.
Embedding is configurable per Knowledge Base. AACFlow supports OpenAI text-embedding-3-large (3072 dimensions, highest quality), text-embedding-3-small (1536 dimensions, 5x cheaper), and Gemini text-embedding-004 (768 dimensions). The embedding model selection is a trade-off between retrieval quality and cost — for most business document use cases, text-embedding-3-small hits the sweet spot.
Tag-based filtering lets you scope retrieval to subsets of your knowledge base. Tag a set of documents as legal or product-specs and your agent will only search within that subset. This is implemented as a pgvector prefilter using a GIN index on the tags array — it does not degrade vector search performance.
A production RAG workflow in AACFlow has four stages:
Stage 1 — Embed the user question. The incoming query is embedded using the same model as the knowledge base. AACFlow's Agent block handles this transparently when you connect a Knowledge Base tool — you do not write embedding code.
Stage 2 — Hybrid retrieval: vector search + BM25. AACFlow performs parallel retrieval: HNSW vector similarity search for semantic matches, and BM25 full-text search for exact keyword matches. The results are merged using Reciprocal Rank Fusion (RRF). Hybrid search consistently outperforms pure vector search by 8–15% on NDCG@10 in enterprise document benchmarks, particularly for queries that contain specific product names, SKUs, or technical identifiers that embeddings handle poorly.
Stage 3 — Re-rank. The top-20 candidates from hybrid retrieval pass through a cross-encoder re-ranker. AACFlow integrates Cohere Rerank and supports custom re-ranking with any model via the HTTP tool. Re-ranking reduces the candidate set to the top-5 most relevant chunks before the LLM generation step.
Stage 4 — Claude generates the answer. The re-ranked chunks are passed to an Agent block configured with Claude Sonnet 4.6. The system prompt instructs the model to cite sources and indicate confidence. The response is streamed back to the user in real time.
Here is the workflow structure in AACFlow:
Webhook Trigger — receives the user's question from your application
Knowledge Base Search block — hybrid retrieval with tag filter and top-20 candidates
HTTP block — calls Cohere Rerank API with the query and candidates
Agent block (Claude Sonnet 4.6) — generates the grounded answer with citations
Response block — returns structured JSON with answer, sources, and confidence
pgvector 0.7 supports two index types. Choosing correctly makes a significant difference at scale.
IVFFlat partitions vectors into lists and searches only the most relevant lists. It is faster to build and uses less memory, but recall degrades below 95% at high list counts. Use IVFFlat when your collection is under 1M vectors and you are building the index frequently (e.g., re-indexing nightly).
HNSW builds a multi-layer graph structure. Query performance does not degrade with collection size — a 10M vector HNSW index queries as fast as a 100K index. Build time is 3–5x longer than IVFFlat and memory usage is higher (~200 bytes per vector), but for production collections that are updated incrementally (not rebuilt from scratch), HNSW is the clear choice.
AACFlow's Knowledge Base uses HNSW by default for all collections over 50K documents, with the following index parameters tuned for balance between speed and recall:
At query time, AACFlow sets SET hnsw.ef_search = 40 for the session, which yields 98.5% recall on standard benchmarks while keeping query latency under 5ms.
The optimal chunk size depends on your document type and query pattern.
Short, factual queries (support FAQs, product specs) perform best with smaller chunks — 256–512 tokens. The retrieved chunks are dense with relevant information and the LLM wastes less context window on irrelevant surrounding content.
Long, analytical queries (legal research, technical documentation) benefit from larger chunks — 512–1024 tokens — because the answer often requires understanding multi-sentence arguments or sequential steps.
AACFlow allows per-Knowledge-Base chunk size configuration. A practical starting point is 512 tokens for mixed document types. Evaluate retrieval quality with your actual queries using the built-in Knowledge Base evaluation view before tuning.
AACFlow was benchmarked on a 10M document knowledge base hosted on a standard production PostgreSQL instance (8 vCPU, 32GB RAM):
Stage
Latency (p50)
Latency (p99)
Query embedding
18ms
28ms
HNSW vector search
4ms
9ms
BM25 full-text search
3ms
7ms
RRF merge
1ms
2ms
Cohere Rerank (top-20→5)
45ms
80ms
Claude Sonnet 4.6 generation
95ms
180ms
Total end-to-end
166ms
306ms
The p50 end-to-end latency of 166ms is well within the 200ms target. The p99 at 306ms is driven almost entirely by LLM generation variability — the retrieval pipeline itself stays under 30ms at p99.
Setting up a production RAG pipeline in AACFlow takes under 30 minutes:
Create a Knowledge Base under Workspace → Knowledge. Select your embedding model and configure chunk size.
Connect your documents via the built-in connectors (Google Drive, Confluence, Notion, S3) or upload directly.
Add the Knowledge Base tool to your Agent block and select the Knowledge Base you created.
Configure tag filters to scope retrieval if you have multiple document categories.
Enable hybrid search in the Knowledge Base settings — it is off by default to minimize setup friction.
The complexity that would otherwise require building and operating a separate embedding service, vector database, and re-ranking layer is handled by AACFlow. You focus on the workflow logic and the quality of your prompts.
pgvector 0.7 is production-ready. Combined with AACFlow's Knowledge Base, it eliminates the need for a separate vector database in the vast majority of enterprise RAG use cases.