Joe Archondis

June 29, 2026 · 9 min read

AI Agents & RAG

Building a RAG Agent Without LangChain

Q: What embedding model should I use for a production RAG system?

OpenAI's text-embedding-3-small is a strong default: 1,536 dimensions, low cost, and consistent retrieval quality across domains. For fully local or open-source setups, nomic-embed-text via Ollama is a solid alternative. The embedding model matters less than chunk size and the similarity threshold you apply before surfacing chunks to the model.

Q: What cosine similarity threshold should I use for RAG retrieval?

0.72 is a good starting point, calibrated per corpus. Below that threshold, retrieved chunks are marginally relevant at best, and the model will try to answer from them anyway, producing confident wrong answers. Start at 0.72, evaluate a sample of real queries, and adjust. Returning nothing is better than returning wrong context.

I built a RAG pipeline for a client using LangChain's RetrievalQA chain. Six weeks in, a patch upgrade silently changed the default text splitter behavior. Chunk sizes shifted. Retrieval scores dropped. The agent started pulling wrong context and the answers degraded — nothing threw an error. Four days passed before I caught it.

I rewrote the pipeline in 90 lines of Python. No LangChain, no LlamaIndex, nothing between me and the retrieval logic. That was eight months ago. The code hasn't needed a change.

Here's the complete implementation: chunking, embedding, vector storage with pgvector, retrieval with a similarity threshold, and generation with Claude.

What LangChain Gets Wrong in Production

LangChain is excellent for getting to a demo fast. The chain abstractions map cleanly to the concept of retrieval-then-generation, there's a built-in component for every step, and you can have something runnable in two hours. For prototyping, it delivers.

Production is different. Tuning a RAG system means controlling chunk size, similarity thresholds, context injection format, and retrieval scoring. Seeing what's happening at each step is how you actually tune those things. LangChain's abstractions hide exactly those layers. When retrieval quality drops, you're debugging a stack of chain wrappers instead of the retrieval logic itself.

Update churn compounds this. LangChain ships fast. Interfaces change between minor versions. A document loader or chain you wrote six weeks ago may silently behave differently after an upgrade — internally changed, no breaking error thrown, retrieval quality shifted. That's the exact failure mode that hit my client's system. The library did something different, nothing told me, and the output degraded quietly.

Removing the abstraction is the fix. Write the retrieval loop yourself, in plain code, with nothing hiding the behavior you need to control. The core RAG pipeline is five steps. None of them require a framework.

The Five Steps of RAG

Every RAG pipeline does the same five things, regardless of what framework (if any) is wrapping them:

Chunk — split source documents into text segments
Embed — convert each chunk into a vector using an embedding model
Store — persist the vectors in a database that supports similarity search
Retrieve — find the top-k chunks most similar to a user's query
Generate — produce a response using the model with retrieved chunks injected as context

That's the whole system. Implemented directly, it runs in 90–150 lines using the OpenAI embeddings API, PostgreSQL with pgvector, and the Anthropic SDK. The next sections cover each step with working code.

Chunking Without a Framework

Chunking is where most RAG quality problems originate. Too small and the embedding doesn't carry enough context for accurate retrieval. Too large and you're averaging over too many ideas at once — the vector starts representing a whole section rather than a single concept, and retrieval accuracy falls.

My default: 512 words per chunk, 100-word overlap. The overlap handles content that spans a chunk boundary. Any fact straddling two adjacent chunks appears in full in at least one of them.

def chunk_text(text: str, chunk_size: int = 512, overlap: int = 100) -> list[str]:
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
        i += chunk_size - overlap
    return chunks

Word-based chunking is close enough to token-based for most document types. If you're chunking dense technical content where token precision matters, swap in tiktoken — one import, the same structure. The chunk size matters more than the unit of measurement.

Chunk size affects retrieval quality more than embedding model choice. I've tested this directly: same model, same similarity threshold, half the chunk size, dramatically different recall. Start at 512, run real queries against your corpus, measure, then adjust. Don't tune the embedding model before you've locked in the chunk size.

Embedding and Storing with pgvector

For embeddings I use OpenAI's text-embedding-3-small: 1,536 dimensions, low cost per token, consistent retrieval quality across domains. Any embedding model that returns a fixed-size vector works — the interface doesn't change.

Vectors go into PostgreSQL with the pgvector extension. No external vector database needed. If you're already running Postgres, pgvector adds similarity search in one extension install with no new service to manage.

-- Run once to set up storage
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_chunks (
    id         SERIAL PRIMARY KEY,
    source     TEXT,
    content    TEXT,
    embedding  vector(1536),
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX ON document_chunks
    USING hnsw (embedding vector_cosine_ops);

The HNSW index makes approximate nearest-neighbor search fast: 5–20ms on 100,000 vectors. Skip the index and you're doing a full table scan. Fine at 500 chunks, unusable at 50,000.

import openai
import psycopg2
import json

embed_client = openai.OpenAI()

def embed_text(text: str) -> list[float]:
    response = embed_client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

def store_chunk(conn, source: str, content: str, embedding: list[float]):
    with conn.cursor() as cur:
        cur.execute(
            """INSERT INTO document_chunks (source, content, embedding)
               VALUES (%s, %s, %s)""",
            (source, content, json.dumps(embedding))
        )
    conn.commit()

Each function does one job. embed_text knows nothing about storage. store_chunk knows nothing about embedding. Replace either side without touching the other.

Retrieval and the Similarity Threshold

Standard retrieval: embed the query, find the top-k chunks by cosine similarity. That works. What it misses: top-k always returns something, even when nothing in the index is actually relevant to the question.

The fix is a minimum similarity score. Below 0.72, a chunk probably isn't relevant. The model will use whatever context you send it — marginally relevant chunks produce confident-sounding wrong answers. Returning nothing is better than returning wrong context.

def retrieve_chunks(
    conn,
    query: str,
    top_k: int = 5,
    min_score: float = 0.72
) -> list[dict]:
    query_embedding = embed_text(query)

    with conn.cursor() as cur:
        cur.execute("""
            SELECT
                content,
                source,
                1 - (embedding <=> %s::vector) AS score
            FROM document_chunks
            WHERE 1 - (embedding <=> %s::vector) >= %s
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (
            json.dumps(query_embedding),
            json.dumps(query_embedding),
            min_score,
            json.dumps(query_embedding),
            top_k
        ))
        rows = cur.fetchall()

    return [
        {"content": row[0], "source": row[1], "score": float(row[2])}
        for row in rows
    ]

The <=> operator is pgvector's cosine distance. 1 - distance = similarity. The threshold is per-corpus — 0.72 is a starting point, not a constant. Raise it if irrelevant context is getting through. Lower it if you're missing relevant chunks.

One thing I do in production: log the scores. Every retrieval call writes the query, the matched chunks, and their scores to a table. After a week of real usage, you can see exactly where the threshold is wrong. Silent retrieval failures are how RAG systems slowly degrade without anyone noticing.

Generation with Claude

Retrieved chunks go into the system prompt as explicit context. The injection is deliberate — you can see exactly what the model is working from. No hidden template formatting, no framework touching the prompt before it reaches the model.

import anthropic

claude = anthropic.Anthropic()

def generate_answer(query: str, chunks: list[dict]) -> str:
    if not chunks:
        return "I don't have enough information to answer that accurately."

    context = "\n\n---\n\n".join(
        f"[Source: {c['source']} | Score: {c['score']:.3f}]\n{c['content']}"
        for c in chunks
    )

    system = f"""Answer the user's question using only the context below.
If the context doesn't contain enough information to answer, say so directly.

Context:
{context}"""

    response = claude.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": query}]
    )

    return response.content[0].text

Logging the source and score in each context block is optional, but useful. When an answer goes wrong, you can see exactly what the model was given and whether the problem was retrieval (wrong chunks) or generation (right chunks, wrong response). Most RAG failures are retrieval failures. The scores tell you which.

The Full Agent Loop

Two functions tie the system together: one for ingesting documents at setup time, one for answering queries at runtime.

def ingest_documents(conn, documents: list[dict]):
    """documents: list of {"source": str, "text": str}"""
    for doc in documents:
        chunks = chunk_text(doc["text"])
        for chunk in chunks:
            embedding = embed_text(chunk)
            store_chunk(conn, doc["source"], chunk, embedding)


def rag_agent(conn, query: str) -> str:
    chunks = retrieve_chunks(conn, query)
    return generate_answer(query, chunks)


# Usage
if __name__ == "__main__":
    conn = psycopg2.connect("postgresql://localhost/ragdb")

    # Ingest once, or re-run when documents update
    ingest_documents(conn, [
        {"source": "operations-manual.pdf", "text": "...full document text..."},
        {"source": "faq.pdf",               "text": "...full document text..."},
    ])

    # Query at runtime
    answer = rag_agent(conn, "What is the process for handling late deliveries?")
    print(answer)

That's the full system. Ingestion runs once, or on a schedule when your documents update. The agent loop runs on every query. Nothing is hidden inside a wrapper you didn't write. Every behavior is something you control.

Total lines across all functions: 92. No framework dependencies, no version-churn risk, no abstraction hiding the layer you need to tune. When something breaks, you know exactly where to look.

When a Framework Actually Makes Sense

There are two cases where I'll reach for LangChain or LlamaIndex.

Prototyping speed. If you need a working demo in two hours for a client call, a framework is the right shortcut. The abstraction is a liability in production, but it's a real advantage when you haven't committed to the architecture yet. Ship the prototype in LangChain, then rewrite before you deploy.

Complex agentic chains. When retrieval is one step inside a larger multi-step loop — with tool use, conditional branching, parallel execution, and memory — a framework's orchestration layer reduces boilerplate. LangGraph handles this well. Even then, I write the core retrieval logic myself and use the framework only for orchestration. The tradeoff changes when the pipeline has ten steps and three parallel paths.

For standard document Q&A, customer support bots, or knowledge base search: write it yourself. Under 150 lines, full control, easy to debug. The framework isn't buying you anything that a week of maintenance won't cost you.

Frequently Asked Questions

Is LangChain required to build a RAG system in Python?

No. The core RAG loop is chunking, embedding, vector search, and generation — none of which require a framework. The full pipeline runs in under 150 lines using the OpenAI embeddings API, pgvector, and the Anthropic SDK. Frameworks add abstraction but remove control over the exact behavior you need to tune in production.

What embedding model should I use for a production RAG system?

OpenAI's text-embedding-3-small is a strong default: 1,536 dimensions, low cost, and consistent retrieval quality. For fully local or open-source setups, nomic-embed-text via Ollama is a solid alternative. The embedding model matters less than chunk size and the similarity threshold you apply before surfacing chunks to the model.

What cosine similarity threshold should I use for retrieval?

0.72 is where I start, calibrated against real queries on each corpus. Below that, chunks are marginally relevant at best — and the model will try to answer from them anyway, producing confident wrong answers. Start at 0.72, run a sample of actual queries, and measure. Returning nothing is better than returning wrong context.

Why use pgvector instead of Pinecone or Weaviate?

If you're already running PostgreSQL, pgvector adds vector similarity search with one extension install. No extra service, no API key, no additional monthly cost. HNSW indexing keeps search at 5–20ms on 100,000 vectors. For most production RAG systems, pgvector handles the load and removes one more dependency to manage and monitor.

When does it make sense to use LangChain or LlamaIndex?

Prototyping speed and complex agentic chains. If you need a running demo in two hours or you're building a system with tool use, conditional branching, and memory layers, the framework abstractions reduce boilerplate. For standard document Q&A or knowledge base search, write the retrieval loop yourself. It's under 150 lines and you'll understand every line of it when something breaks.

RAG Pipeline Architecture: How to Build It Without a Framework — The full production architecture: chunking strategy, pgvector indexing, retrieval scoring
Tool Use vs. RAG: Choosing the Right Pattern for Your Agent — When to use RAG, when direct tool calls are a better fit, and the hybrid pattern

Working on something similar?

I build AI agents and low-latency systems. If you're trying to solve a version of this, let's talk.

Get in touch

Author: Joe Archondis — AI systems engineer and HFT infrastructure builder.

Last updated: 2026-06-29