Joe Archondis

June 24, 2026 · 10 min read

AI Agents & RAG

RAG Pipeline Architecture: How to Build It Without a Framework

Q: What is the difference between RAG and fine-tuning?

Fine-tuning changes the model weights, requires training data, and does not handle frequently updated information well. RAG keeps the model fixed and injects knowledge at query time. For most business use cases, RAG is faster to build, cheaper to maintain, and handles dynamic data better.

Q: What embedding model should I use for RAG?

text-embedding-3-small from OpenAI is a good default for most production workloads. It costs $0.02 per million tokens and performs well on typical knowledge base retrieval. Run evaluation on a sample of your actual queries before upgrading to a premium model.

The first RAG system I built for production used LangChain. Three days wiring up abstractions, another day debugging why retrieval returned the wrong chunks, and an afternoon figuring out why a patch upgrade broke the pipeline silently. I rewrote it from scratch over a weekend. 120 lines of Python. It ran faster, debugged in minutes, and I understood every decision it was making.

Here's what a production RAG pipeline actually looks like under the hood, and why a framework might be the last thing you need.

What RAG Is Actually Solving

Your LLM has a training cutoff and no access to your data. RAG bridges that gap without fine-tuning the model.

The mechanism: before sending a question to the model, search your knowledge base for relevant content, then inject the most relevant pieces into the prompt. The model answers using its pretrained reasoning plus the specific material you retrieved.

Three jobs. One pipeline: encode your documents into a searchable format (offline), retrieve the most relevant chunks when a query arrives (online), generate an answer using retrieved context plus the query (generation). Frameworks haven't invented anything beyond this structure. They've wrapped it in abstractions.

The Six Components

A RAG pipeline has exactly six pieces. Every framework is an opinionated way to compose these.

Document loader reads your source data: a PDF parser, a web scraper, a database query, a Notion API call. Output is raw text.

Chunker splits that text into segments. This is where most systems go wrong.

Embedding model converts each chunk into a dense vector that encodes semantic meaning. Similar chunks produce vectors close in high-dimensional space. I use text-embedding-3-small from OpenAI for most projects. $0.02 per million tokens, and it performs well enough for production without touching the premium tiers.

Vector store indexes those vectors for fast similarity search. I use pgvector on PostgreSQL. No separate service to manage, no additional billing surface, and it supports hybrid search natively.

Retriever takes a query, embeds it, searches for nearest vectors, returns the top-k chunks. How you score and filter here determines retrieval quality more than any other single decision.

Generator assembles retrieved chunks into a context block, builds the prompt, calls the LLM. I use Claude for most of this work. The 200k context window means context overflow is rarely a limiting factor.

Why I Stopped Using LangChain

LangChain is useful for prototyping fast. For production systems that need to be debugged and tuned over months, it gets in the way.

Abstractions hide state at debugging time. When retrieval returns garbage, you need to see the actual similarity scores, the actual chunks being assembled, the actual prompt being sent. LangChain's abstractions sit between you and all of that. You end up adding verbose=True and reading through 40 lines of framework logging to find one broken value.

Silent breaking changes. LangChain releases multiple times a week. Patch upgrades change chunk assembly behavior, metadata handling, retriever scoring, often without obvious changelog entries. I've had pipelines degrade after a dependency update with no clear signal about what changed.

The core logic is small. Embedding a query is one API call. Vector search is one SQL statement. Assembling a context block is string operations with token counting. That doesn't need 15 layers of abstraction.

None of this means LangChain is wrong everywhere. If you're validating an idea in 48 hours, use it. If you're building something that runs for 12 months, own your retrieval stack.

The Production Architecture

Stack I've shipped in production:

FastAPI: HTTP layer, /query and /ingest endpoints
PostgreSQL + pgvector: vector storage with HNSW indexing
OpenAI text-embedding-3-small: 1536 dimensions
Claude via Anthropic SDK: generation with streaming enabled
Google Cloud Run: scales to zero, deploys in one command

The retrieval function:

import asyncpg
from openai import AsyncOpenAI

oai = AsyncOpenAI()

async def retrieve(query: str, pool: asyncpg.Pool, top_k: int = 6) -> list[dict]:
    result = await oai.embeddings.create(
        model="text-embedding-3-small",
        input=query
    )
    query_vector = result.data[0].embedding

    rows = await pool.fetch(
        """
        SELECT chunk_text, source, 1 - (embedding <=> $1::vector) AS score
        FROM documents
        WHERE 1 - (embedding <=> $1::vector) > 0.72
        ORDER BY embedding <=> $1::vector
        LIMIT $2
        """,
        query_vector, top_k
    )
    return [dict(r) for r in rows]

Two decisions worth explaining:

The 0.72 score threshold. Low-relevance context is worse than no context. The model tries to use whatever you give it, and marginally relevant material actively degrades answer quality. The right threshold depends on your data. I start at 0.72 and tune from there.

top_k = 6. Not a universal answer. For factual Q&A over a focused knowledge base, 3-4 chunks is usually enough. For complex questions, 8-10. Track how much of the retrieved context the model actually uses and adjust.

Streaming matters. The LLM call dominates total latency, anywhere from 500ms to 3s depending on output length. Start sending bytes to the client before the full response is ready. First-token time under 500ms feels responsive. Over 1.5 seconds feels broken, even if total generation time is identical.

Chunking Strategy

Chunking determines more of retrieval quality than your choice of embedding model or vector store. Bad chunks produce irrelevant retrievals, and no amount of downstream tuning fixes that.

Fixed-size chunking splits on character count with overlap. Fast and predictable, but it breaks sentences mid-thought. Works well for structured content where meaning is localized: log entries, database records, tabular data.

Semantic chunking splits on sentence or paragraph boundaries, then groups by semantic coherence. Better for prose. Worth the extra compute for knowledge bases built from written documents, reports, or documentation.

Sliding window adds 20-30% overlap between adjacent chunks. Context spanning a boundary appears in at least one chunk. This is the single easiest improvement if retrieval keeps missing information you know is in the document.

What I use: paragraph-boundary chunking, 512-token target size, 100-token overlap. Chunks under 100 tokens rarely carry enough context. Chunks over 1000 tokens dilute the semantic signal — the embedding starts representing too many ideas at once.

Chunk metadata is as important as chunk content. Every chunk should store its source document ID, section heading, position in the document, and creation timestamp. Without this, you can't filter by recency or source, and you lose the ability to cite answers. I store it in a metadata JSONB column alongside the vector.

What Production Actually Looks Like

Embedding freshness. When source documents change, stale vectors return outdated context. Handle this with content hashing: on ingest, compute a SHA-256 of the chunk text and check it against the stored hash. Only re-embed if the content changed. On a 10,000-document knowledge base, this cuts re-index time by over 80% on a typical update run.

Context window budgeting. Claude has a 200k token context window. That doesn't mean you should fill it. Quality degrades as you inject noise. I cap injected context at 8,000 tokens regardless of what the retriever returns. Beyond that, you're usually adding marginally relevant material that confuses more than it helps.

Latency breakdown. Embedding the query: 50-100ms. pgvector HNSW search: 5-20ms. LLM generation: 500ms to 3s. The first two are fast enough that optimizing them is premature. Stream the generation response and you address perceived latency better than any architectural change.

Evaluating retrieval quality. Most teams don't measure this, which is why most RAG systems quietly degrade. I track three signals: average cosine similarity of returned chunks, what fraction of retrieved context the model actually references, and explicit user feedback where the interface supports it. A sustained drop in average similarity usually means the knowledge base needs re-chunking or the query distribution has shifted.

Framework vs. Bare-Metal

Concern	Framework (LangChain)	Bare-Metal
Time to first prototype	Hours	Half a day
Production debugging	Hard: abstractions hide state	Easy: full visibility
Upgrade risk	High: frequent breaking changes	None: you own the code
Performance tuning	Limited by framework constraints	Full control
Lines of code	Fewer, until abstractions leak	~200 for core pipeline
Vendor lock-in	Medium: tied to LangChain versions	None
Custom retrieval logic	Hard to extend cleanly	Straightforward

Frequently Asked Questions

What is the difference between RAG and fine-tuning?

Fine-tuning changes the model's weights, requires training data, and handles frequently updated information poorly. RAG keeps the model fixed and injects knowledge at query time. For most business use cases, RAG is faster to build, cheaper to maintain, and handles dynamic data better than fine-tuning.

Do I need a vector database like Pinecone for RAG?

No. pgvector on PostgreSQL handles millions of vectors at production scale without issue. A dedicated vector database makes sense above 100 million vectors or at very high query rates. Below that threshold, pgvector with HNSW indexing is simpler, cheaper, and one fewer service to manage.

How do I handle multi-turn conversations in a RAG system?

Include recent conversation turns when building the retrieval query. Follow-up questions often reference earlier answers ("what about that last metric?"), so the retriever needs full conversation context to resolve those references. Appending the last 2-3 turns to the query before embedding handles this well in practice.

What embedding model should I use for RAG?

text-embedding-3-small from OpenAI is a good default for most production workloads. It costs $0.02 per million tokens and performs well on typical knowledge base retrieval tasks. Run evaluation on a sample of your actual queries before upgrading to a premium model.

When is RAG the wrong architecture?

When the answer requires reasoning across the entire document set rather than retrieving a specific passage. RAG retrieves; it doesn't synthesize. For analytical questions that need information synthesized from dozens of documents, pass more context directly or use a different architecture entirely.

Working on something similar?

I build AI agents and low-latency systems. If you're trying to solve a version of this, let's talk.

Get in touch

Author: Joe Archondis — AI systems engineer and HFT infrastructure builder.

Last updated: 2026-06-24