RAG System Architecture Guide

Q: Should the agent use ReAct or ReWoo?

ReWoo (plan once, execute all tools, then solve) is the better default when the space of query types is predictable and tool selection doesn't depend on intermediate results. It's always exactly two LLM calls, plan and solve, with tool execution handled by deterministic code in between, fully parallelizable, and has no runaway-loop risk. ReAct (interleaved reasoning and tool calls) is worth the added latency and cost only when queries are genuinely open-ended and each tool result should change the next action.

What is a RAG system?

Retrieval-Augmented Generation (RAG) grounds an LLM's answers in a specific corpus of documents. The model retrieves relevant passages at query time and writes its answer from them, citing where each piece came from.

That description works just as well for a weekend prototype as it does for something running in production. The gap between the two is in the parts most tutorials skip. Real documents are scanned badly, formatted inconsistently, and full of structure specific to whatever domain they came from. A good number of them also contain data you can't afford to mishandle. A system built for production has to parse through that mess, chunk it without losing meaning, retrieve with more than one similarity score, keep sensitive fields out of the retrieval layer entirely, and log enough about its own behavior to be debugged and tested, not just trusted.

This guide covers a reference architecture for exactly that: a RAG system built to run in production. It's built around four layers:

Ingestion and preprocessing
Embedding and hybrid retrieval
An agentic orchestration layer
Observability

Two concerns cut across all four: protecting sensitive data, and making the whole pipeline traceable and testable enough that "it seems to work" is never the bar. It can also be used across multiple verticals. I've used the same pipeline shape for legal contracts, clinical notes, financial filings, and internal operational data. What changes between verticals is the chunking rules and the extraction schema, not the architecture itself.

Most RAG failures I've run into trace back to retrieval, not the LLM. If the wrong chunks make it to the model, no amount of prompt tweaking gets you the right answer. Everything below is in service of getting retrieval right first, and keeping it safe and observable once it does.

Architecture Overview

A production RAG system is built from five layers. Three of them run once, at ingestion, and cache their output. Two run on every single request:

Layer	Responsibility	Runs
Parser	Reads raw source files (OCR JSON, PDFs, scans), reconstructs text and per-block confidence, classifies document type, chunks by structure.	Ingestion
Extractor	One LLM call per document, producing a structured JSON profile (parties, amounts, dates, status flags) per document type. Cached to disk.	Ingestion
Retrieval engine	Encodes chunks into embeddings, builds a lexical (BM25) index, loads a cross-encoder for reranking. Exposes a single hybrid search function.	Startup, cached
Agent (planner + tools + solver)	Turns a natural-language query into a bounded tool plan, executes deterministic tools, synthesizes a cited answer.	Per request
Metrics & tracing	Records latency, token cost, and retrieval relevance per request; traces every LLM and tool call for debugging.	Per request

The dividing line between "ingestion" and "per request" matters more than it looks. Parsing, chunking, and profile extraction are expensive: LLM calls, embedding computation, real wall-clock time. But the corpus rarely changes mid-session, so all of that runs once and gets cached to disk. Only the agent loop and its two LLM calls run per user query. Get that split right and per-request latency stays in the low seconds. Get it wrong, and every query pays for work that should have happened once.

I. Ingestion & Preprocessing

Parsing Raw Documents

Real document corpora rarely arrive as clean text. Scanned paperwork comes back as OCR JSON (Google Vision, AWS Textract, Azure Document Intelligence) with text reconstructed from pages → blocks → paragraphs → words → symbols, each carrying its own confidence score. The parser's job is to walk that structure, reconstruct readable text per block, and compute a mean confidence per block. That confidence score becomes a first-class field that follows the chunk all the way to the final answer, not a debug log line nobody reads.

Document type classification runs before chunking, since chunking strategy depends on it. A two-pass heuristic works well in practice: check filename keywords first, then fall back to content keywords in the first few hundred characters. It has to tolerate OCR corruption too. A national ID scanned badly enough to read "CARTE MATIONALE" instead of "CARTE NATIONALE" should still classify correctly through fuzzy keyword matching, not an exact string match that quietly fails on every third scan.

Chunking Strategy by Document Class

The biggest mistake in RAG preprocessing is applying one uniform chunk size to every document. A fixed-size splitter (e.g, 500 tokens with overlap) is a reasonable default for unstructured prose. But it cuts through the middle of clauses, line items, and structured fields, which is exactly the content people ask precise questions about. Chunk by document structure instead of character count, and the citation unit stays intact:

Document class	Examples across verticals	Chunking strategy	Result
Long structured / clause-based	Contracts, insurance policies, loan agreements, terms of service	Regex split on numbered articles / section headers (e.g. "ARTICLE 4 —", "SECTION 12.")	~10-20 chunks/doc
Semi-structured technical report	Lab results, energy/compliance certificates, inspection reports, clinical notes	Split on ALL-CAPS or bolded section headers ≥ 5 characters	~4-6 chunks/doc
Short single-topic	ID cards, licenses, utility bills, certificates of insurance	Always one chunk per document	1 chunk/doc
Tabular / line-item	Invoices, claims line items, account statements	Table-aware parsing, one chunk per table or row group	Varies by table size

Fallback cascade: header-based regex chunking fails whenever a document doesn't match the expected template — a different contract format, a report missing its usual headers, that sort of thing. The pipeline needs a graceful degradation path. If header regex matches nothing, fall back to OCR block boundaries from the source layout. If that also comes up empty, fall back to treating the whole document as one chunk. A chunking failure should degrade retrieval quality. It should never crash ingestion.

Citation-stable IDs: every chunk carries a stable, human-readable ID encoding its exact location, like invoice_2024_0143#line_items or contract_A118#section_12. This ID is the unit the retrieval layer returns and the unit the solver is instructed to cite. Without it, "the answer came from somewhere in this 40-page document" is the best a system can do; with it, the answer points to an exact, verifiable section.

Structured Extraction

Chunking solves retrieval; it doesn't solve structured questions like "which documents are missing" or "is this certificate expired." Those need a second pass: one LLM call per document that reconstructs the full document text from its chunks and extracts a structured JSON profile, using a prompt tailored to that document type's expected fields (parties, amounts, dates, expiry flags, status).

Cache aggressively. Profiles are stable between corpus updates. Re-extracting on every request costs one LLM call per document, per query, which doesn't scale. Extract once at ingestion, write the result to disk keyed by document ID, and skip any document that already has a cached profile on subsequent runs. The upgrade path, once the corpus becomes dynamic (new documents arriving continuously), is incremental extraction triggered on document arrival rather than a full batch re-run.

Handling Sensitive Data

Most write-ups on RAG skip this step entirely, which is odd given how much of a real corpus is exactly the kind of data you don't want sitting in a vector index unprotected. Full account numbers, payment card numbers, national ID numbers, medical record numbers. None of that belongs anywhere near a retrieval layer that a poorly-scoped prompt, an overly broad tool call, or a compromised vector store could expose.

The fix is a redaction step between extraction and everything downstream of it:

Run a PII detector over every document before chunking, not after. Sensitive spans that get split across chunk boundaries are much harder to catch reliably.
Redact or tokenize sensitive fields (full PANs, account numbers, SSNs), replacing each with a stable placeholder token rather than a blackout mark, so downstream extraction can still reason about "an SSN was present here" without ever seeing the value.
Store the token → real value mapping in an encrypted vault, kept entirely separate from the retrieval layer. If the vector store or an LLM's context window is ever exposed, the mapping isn't sitting next to it.
Only clean, tokenized text should ever reach the vector DB or an LLM prompt. A chunk that fails tokenization for any reason should fail closed: held for manual review, never indexed as-is.

This is the one part of the architecture I'd fight hardest to keep a hard requirement instead of a config flag someone quietly leaves off in staging. A mediocre retrieval score is a Tuesday-afternoon problem. A real SSN sitting in a vector index because someone skipped the redaction step is the kind of mistake that ends the project. Or worse.

Why Embeddings Alone Aren't Enough

Embeddings give the system semantic search: matching a query to a chunk by meaning rather than exact words. That matters for three concrete reasons that show up in every document-heavy vertical:

Users don't phrase questions in the document's words. A user asks "what's the sale price?" but the contract says "for the principal sum of $285,000." There's no shared keyword — but in embedding space, both land in the same semantic neighborhood.
Source text is unreliable. Scanned documents contain OCR corruption. Keyword matching breaks on garbled tokens; embeddings degrade more gracefully because surrounding context still places the chunk roughly correctly.
Domain synonyms and paraphrase. "Buyer", "purchaser", "the acquiring party" — a good embedding model maps these near each other without anyone maintaining a synonym list by hand.

But embeddings are weak at exactly the things lexical search is strong at: account numbers, article references, product SKUs, proper names, legal codes. A query for "clause 14.3" should retrieve the chunk containing that literal string with certainty. An embedding model has no obligation to rank it first. This is the core justification for hybrid retrieval, covered next.

II. Embedding & Hybrid Retrieval

Embedding Model & Vector Storage

For a self-hosted stack, a local sentence-transformers model (e.g. a multilingual MiniLM variant) producing normalized vectors is a solid default: it's free per-call, runs on CPU, and caches its weights locally after first load. A hosted embedding API (OpenAI, Cohere, Voyage) trades that infrastructure cost for simpler ops and, in some cases, stronger retrieval quality. Worth revisiting if precision becomes the bottleneck at scale.

Vector storage follows the same right-sizing logic. A numpy array with a brute-force dot product (cosine similarity, since vectors are normalized) comfortably serves corpora up to the low tens of thousands of chunks, in-memory, with zero service dependency. A dedicated vector database (Pinecone, Qdrant, pgvector) earns its operational cost once the corpus grows past that point, or once filtered approximate nearest-neighbor search at consistent low latency becomes a hard requirement.

Three-Stage Hybrid Search

Cosine-only retrieval is the most common production shortcoming in RAG systems. It misses exact lexical matches. The fix is a three-stage pipeline that fuses semantic and lexical retrieval, then reranks the result with a model that actually looks at query and document together:

Stage	What it does	Cost
1. Parallel retrieval	Cosine similarity over normalized embeddings and BM25 over tokenized chunk text independently rank the same (optionally filtered) chunk set.	Sub-millisecond at <100k chunks
2. RRF fusion	Top-N from each ranked list is unioned and scored by Reciprocal Rank Fusion: `1/(60 + rank_cosine) + 1/(60 + rank_bm25)`. The constant 60 is the standard value from the original RRF paper (Cormack, Clarke & Buettcher, 2009).	Negligible
3. Cross-encoder rerank	A cross-encoder scores every fused candidate as a `(query, chunk_text)` pair jointly — seeing both at once, unlike a bi-encoder — and re-sorts to the final top-k.	~50-150 ms on CPU at small candidate counts

The cross-encoder's output logit, passed through a sigmoid, is a better-calibrated relevance score (0-1) than raw cosine similarity, and doubles as a useful per-request metric. The highest score across a request's retrieval calls is a solid proxy for retrieval_relevance. Track it over time; a drop signals a corpus or query-drift problem before users start complaining.

Metadata filters (document type, collection, date range) should apply before ranking, not after. Filtering a top-k result set after the fact just means some queries silently return fewer than k results. Both the cosine and BM25 stages should accept the same filter and apply it to the candidate set up front.

III. Agentic Design: ReWoo vs. ReAct

The Loop-Depth Problem

The default agent pattern most people reach for is ReAct: interleaved reasoning and tool calls, where each step can adapt based on what the previous step found:

Thought: I need to check all documents in this collection
Action: get_collection_documents(collection_id)
Observation: [results]
Thought: I see a low-confidence extraction, let me verify against another document
Action: search_documents("policy holder name", collection_id)
Observation: [results]
Final answer: ...

ReAct is genuinely adaptive, but it's a true loop. Cost and latency scale with however many steps the model decides it needs, and it requires a hard max_iterations guard to prevent runaway depth on an ambiguous query.

ReWoo: Plan Once, Execute, Solve

ReWoo (Reasoning Without Observation) produces the full tool plan upfront in a single pass, executes every tool (in parallel where possible), then synthesizes:

Plan: [get_document_inventory(), get_collection_documents(collection_id)]
Execute: both tools run in parallel
Solve: synthesize final answer from all results, with citations

This is always exactly two LLM calls, one to plan and one to solve, regardless of query complexity. Tool execution happens in between, and it's plain deterministic code, not a model call. The loop-depth problem is solved architecturally: there is no loop. The only guard needed is a cap on how many tool calls a single plan may contain (e.g. max_tools_per_plan = 5), which prevents an overly ambitious plan from burning excess tokens. That's a simpler and more principled constraint than a runtime iteration counter.

Why this fits most document-QA workloads: query types in a bounded document corpus are more predictable than they first appear. "Are there inconsistencies in this collection?" almost always maps to fetching every document in that collection; "what's missing?" almost always maps to the inventory tool. The planner can determine the right tools upfront without needing to see intermediate results first.

ReWoo vs. ReAct: Trade-offs

	ReWoo	ReAct
LLM calls	Always 2 (plan, then solve)	Variable (1 per tool + 1 final)
Latency	Lower (parallel tool execution)	Higher (sequential)
Cost	Predictable	Variable
Adaptive discovery	No — plan is fixed	Yes — each step informs the next
Loop risk	None	Requires a max-iterations guard

ReAct remains the right upgrade path once query patterns become genuinely open-ended, where an intermediate tool result needs to change the retrieval strategy mid-query rather than just supply data to a fixed plan.

Tools: Keep Them Deterministic

Every tool the planner can call should be pure, deterministic Python with no LLM calls inside it. The planner decides what to run, the solver decides what it means, and the tool layer just executes, in-memory, sub-millisecond per call:

Tool	Returns	When the planner uses it
`search_documents`	Top-k chunks from hybrid search, each with ID, text, confidence, relevance score	Targeted lookups — a specific price, date, name, or clause
`get_collection_documents`	All chunks in a given collection/case/file, unranked	Coherence checks, cross-document inconsistency detection
`get_document_inventory`	Per-collection map of document types present, `missing_types`, `complete` flag	Completeness audits — "what's missing"

The get_document_inventory tool is worth calling out: it returns a completeness checklist (which document types are required per collection, and which are absent) computed directly from the extracted profiles, so the solver can answer "what's missing" without inferring absence from a list. That's a task LLMs are surprisingly unreliable at.

Source Citation & Confidence Flagging

Every chunk's stable ID travels with it from retrieval through to the final answer. The solver's system prompt should require citing exactly which chunk IDs it used. That turns every answer into something a human can verify against the source in seconds:

"The buyer on file is Jane Doe.
Sources: contract_A118#section_02, id_scan_004#identity"

For chunks below an extraction-confidence threshold, the agent should flag this explicitly rather than presenting uncertain data as fact. That's non-negotiable in regulated verticals:

"The scanned ID for this file has low OCR confidence (0.41) — the extracted name may be inaccurate."

Applying This Architecture Across Verticals

The pipeline shape stays fixed: parse -> classify -> chunk by structure -> extract -> embed -> hybrid-retrieve -> plan -> execute -> solve. What changes per vertical is the chunking rules, the extraction schema, and the completeness checklist the agent reasons against:

RAG pipeline grouped into three phases: pre-processing (parse, classify, chunk by structure, extract), processing (embed, hybrid-retrieve), and agentic design (plan, execute, solve)

Vertical	Typical documents	What the agent is asked
Legal	Contracts, amendments, correspondence, filings	Locate a specific clause, compare terms across contract versions, flag missing signature pages
Healthcare	Clinical notes, lab results, discharge summaries, referral letters	Summarize a patient history, surface a specific lab value, flag missing consent forms
Financial services	Loan agreements, KYC documents, statements, filings	Verify a borrower's stated income against supporting documents, check KYC completeness
Insurance	Policies, claims forms, adjuster reports, photos with OCR'd annotations	Check a claim against policy coverage terms, flag inconsistent damage descriptions
Customer support	Knowledge base articles, past tickets, product manuals	Answer a product question with citations, escalate when no matching article exists

What stays constant across all five: structure-aware chunking instead of fixed-size splitting, hybrid retrieval instead of embeddings alone, a bounded agent loop instead of an open-ended one, and citation down to the exact section. None of that depends on what the documents are actually about.

IV. Making It Production-Grade: Observability, Testing & Iteration

The decision that mattered most

The most important realization in building a RAG system isn't a specific technical choice. It's treating retrieval quality as something you measure, not something you assume. The instinct is to tune the prompt when an answer looks wrong. That's usually the wrong layer to touch.

Instrument every request with three numbers: latency (broken down per stage: embedding, lexical search, rerank, planner LLM, solver LLM), token cost, and a retrieval-relevance score (the calibrated cross-encoder score is a good proxy). When an answer is wrong, check retrieval relevance first. A low score means the wrong chunks reached the model, and no prompt fix recovers that. A high score with a wrong answer means retrieval did its job and the problem is in synthesis, which is where prompt iteration actually helps.

Tracing every LLM and tool call end to end, not just logging the final answer, is what makes that diagnosis possible after the fact instead of only in a live debugging session. A tool like Langfuse works fine for this. Make it optional at the infrastructure level: if trace keys aren't configured, skip tracing silently rather than failing the request.

What Good Observability Actually Looks Like

In practice this means every response carries its own receipt. A response payload from a system I've run in production looks roughly like this:

{
  "latency_ms": 1840,
  "cost_usd": 0.0094,
  "input_tokens": 2145,
  "output_tokens": 312,
  "retrieval_relevance": 0.94,
  "breakdown": { "planning": 620, "tools": 12, "solving": 890 }
}

That single object answers "is this response trustworthy" and "what did it cost" without anyone having to go dig through logs. Quality itself splits into three separate things worth tracking separately, because they fail for different reasons and get fixed in different layers:

Retrieval relevance. Always on, free to compute, comes straight out of the reranker. Are the retrieved chunks actually close to the query?
Faithfulness. Is the answer grounded in what was retrieved, or is the model filling gaps on its own? This is the one that matters most anywhere a wrong answer has real consequences, which is most document-heavy verticals.
Answer relevance. Usually scored offline against a held-out set of queries. Does the answer actually address what was asked, independent of whether it's grounded?

Keep evaluation logic decoupled from the live request path. Scoring approaches change as you learn more about where the system fails; the pipeline serving user traffic shouldn't have to change every time the eval rubric does.

Testing What You Can't Eyeball

Retrieval bugs don't throw exceptions. A hybrid search that quietly starts returning the wrong chunks still returns a response, and it can look fine right up until someone notices the answers have gotten worse. That makes automated testing of the retrieval layer non-negotiable, not a nice-to-have.

Mock the LLM calls in tests. Run the deterministic pieces (parsing, chunking, hybrid search, tool execution) against real fixture data instead. A parser change or a ranking tweak should never need a live model call to verify. On top of that, keep a small set of representative test queries per corpus, with the chunk IDs each one is expected to retrieve, and run that set before every deploy. It's a handful of test cases, not a research benchmark, but it catches the regression that code review alone won't.

One failure mode worth testing for specifically: a fact that spans two adjacent chunks and gets missed because retrieval only surfaces one of them. Section-level chunking mostly avoids this, but not always. The workaround at query time is prompting the solver to consider adjacent sections when a claim looks incomplete. The more durable fix is a small sliding-window overlap between adjacent chunks, so the boundary itself stops being a place where information gets lost.

Manual scoring against a handful of test questions is a fine starting point and works for a while. Once query volume and corpus size outgrow what a person can spot-check, a framework like RAGAS for automated faithfulness and relevance scoring becomes worth the setup cost. Bringing in that kind of framework on day one is usually premature; not having a path to it by the time you need it is the more common mistake.

Key Architecture Trade-offs

Decision	Chosen	Alternative	Revisit when
Agent pattern	ReWoo (always 2 LLM calls)	ReAct (interleaved, variable)	Queries become open-ended enough that intermediate results must change strategy
Embeddings	Local sentence-transformers model	Hosted embedding API	Retrieval precision becomes the bottleneck — a one-file swap
Vector store	In-memory dot product	Dedicated vector database	~10k+ chunks, or filtered ANN latency matters
Chunk granularity	Structure-aware, per document class	Fixed-size splitting with overlap	Cross-section context is lost — add overlap between adjacent sections
Profile extraction timing	Once at ingestion, cached	At query time, per request	Corpus becomes dynamic — switch to incremental extraction on arrival
Retrieval pipeline	Hybrid: cosine + BM25 → RRF → cross-encoder rerank	Cosine-only	Latency becomes a bottleneck — swap in a lighter reranker or batch async execution
Multi-turn history	Client-managed history array	Server-side session store	Multi-device session resumption or audit logging of conversation history is needed

Frequently Asked Questions

What makes a RAG system "production-grade" instead of a demo?

A production RAG system handles messy source documents (scans, inconsistent formatting, OCR errors), chunks by document structure rather than fixed character counts, retrieves with a hybrid of lexical and semantic search rather than embeddings alone, cites its sources at the exact section level, and exposes latency, cost, and retrieval-quality metrics per request. A demo usually skips all five.

Why not just use vector similarity search on its own?

Embeddings are excellent at matching meaning but weak at exact lexical matches: account numbers, article references, product codes, proper names, and anything garbled by OCR. Combining embeddings with a lexical method like BM25 through Reciprocal Rank Fusion, then reranking the fused shortlist with a cross-encoder, consistently outperforms either method alone on real-world document corpora.

When do I need a vector database instead of an in-memory index?

For corpora in the thousands of chunks, an in-memory numpy array with a brute-force dot product is faster to build, has zero infrastructure cost, and is easier to debug than a managed vector database. The switch to Pinecone, Qdrant, or pgvector becomes worthwhile once you cross roughly 50,000-100,000 chunks, need filtered approximate nearest neighbor search at low latency, or require the index to persist and update independently of the application process.

Should the agent use ReAct or ReWoo?

ReWoo (plan once, execute all tools, then solve) is the better default when the space of query types is predictable and tool selection doesn't depend on intermediate results. It's always exactly two LLM calls, plan and solve, with deterministic tool execution in between, fully parallelizable, and has no runaway-loop risk. ReAct (interleaved reasoning and tool calls) is worth the added latency and cost only when queries are genuinely open-ended and each tool result should change the next action.

How should the system handle low-confidence extractions from scanned documents?

Carry a per-chunk confidence score (OCR symbol confidence, or an extraction model's own confidence) all the way through to the final answer. Instruct the solver to flag any chunk below a threshold explicitly in its response, rather than silently presenting uncertain data as fact. That's a hard requirement in regulated verticals like healthcare, insurance, and financial services.

How is this architecture different across verticals like legal, healthcare, or finance?

The pipeline shape stays constant: parse, classify, chunk by structure, extract structured fields, embed, hybrid-retrieve, plan, execute, solve. What changes per vertical is the chunking rules (contract articles vs. clinical note sections vs. invoice line items), the extraction schema per document type, and the completeness checklist the agent reasons against. The retrieval and agent layers are almost entirely vertical-agnostic.

RAG System Architecture: Preprocessing, Retrieval & Agentic Design