Joe Archondis

June 28, 2026 · 9 min read

AI Agents & RAG

Tool Use vs. RAG: Choosing the Right Pattern for Your Agent

Tool Use vs RAG

I built a RAG pipeline for the ShawaMama ops bot over three months of POS records and menu data. It took four days. Three weeks in, I ripped it out and replaced it with two direct tool calls. The owner's most-asked question was "what sold best yesterday?" — and I had been answering it with a vector search over documents that were 24 hours stale. Tool use returned fresh data in 800ms. RAG had been technically working while actually wrong.

The mismatch wasn't a bug. It was an architecture decision made too fast. Here's the decision framework I now run before writing a single line of retrieval code.

What These Patterns Are Actually Solving

RAG and tool use both solve the same surface problem: the model needs information it doesn't have. They solve it in completely different ways.

RAG retrieves semantically relevant content from a corpus. You chunk documents, embed them, store vectors in a database, then at query time embed the query, find the nearest vectors, and inject those chunks into the model's context. The model reads what you retrieved and answers from it. The knowledge lives in the corpus; retrieval surfaces it.

Tool use lets the model call functions at runtime. You define tools as JSON schemas — name, description, parameters — and the model decides when to call them and what arguments to pass. Your code executes the function, returns the result, and the model generates a response using that live data. The knowledge lives in your systems; the tool exposes it on demand.

The critical difference: RAG answers from a fixed snapshot. Tool use answers from current state. Use RAG for data that changes hourly and you'll have an agent that confidently reports yesterday's inventory as today's. Use tool use for a 200-page manual where any section might be relevant and you'll spend weeks writing functions for every possible question type — which isn't feasible and misses the point.

Get this boundary right and retrieval quality improves immediately. Get it wrong and the system produces confident-sounding nonsense. Most production agents I've debugged fail not because the retrieval logic is broken, but because the wrong retrieval pattern was chosen at the start.

When RAG Is the Right Call

RAG wins when your source material is document-heavy, changes infrequently, and can't be bounded by a clean schema.

Knowledge base Q&A. A customer support bot over 200 pages of product documentation. Users ask anything — setup instructions, billing policies, troubleshooting steps. The relevant answer could be in any section. Writing a separate tool for every possible question type isn't feasible. RAG handles the open-ended question-to-document mapping without you predicting the queries in advance.

Internal operations manuals. Standard operating procedures, training guides, policy documents. Content that lives as prose, doesn't have a clean database schema, and updates quarterly at most. RAG ingests it as text and the model can answer questions like "what's the return policy for late deliveries?" without you writing a function for that specific query.

Research over large corpora. "Find everything we know about supplier X's delivery issues." The question spans multiple documents and requires synthesis. Tool use would need you to know in advance which records to query. RAG handles the search over the whole corpus implicitly.

One threshold that matters in practice: I use a 0.72 cosine similarity score before surfacing any retrieved chunk. Below that cutoff, the model receives marginally relevant content — and then tries to use it anyway. Marginally relevant context produces confident wrong answers. The right threshold depends on your corpus, but skipping it entirely is a mistake I made once and won't repeat.

Chunk size determines more of retrieval quality than your embedding model choice. I target 512 tokens with 100-token overlap. Chunks under 100 tokens don't carry enough context for the model to reason from. Chunks over 1,000 tokens dilute the semantic signal — the embedding starts representing too many ideas at once and retrieval accuracy drops. The overlap handles content that spans a chunk boundary: any fact straddling two adjacent chunks appears in at least one of them.

When Tool Use Beats RAG

Tool use wins when the data is structured, live, or bounded by a clear query schema.

Live databases. "How many orders came in this morning?" is not a retrieval problem. It's a SQL query. One tool call, one result, roughly 50ms with a properly indexed table. RAG can't answer this correctly because no embedding of yesterday's order data tells you what arrived this morning. This is the exact case that killed my ShawaMama RAG pipeline.

External APIs. Exchange rates, shipping status, payment state, weather conditions — anything with a REST endpoint. These change continuously. There's no document to embed. Define the API call as a tool, let the model supply the right parameters, get the live answer in one round trip.

Bounded structured questions. The ShawaMama ops bot handles questions like "what was my best-selling item last week?" and "compare Saturday at Oberkampf to the previous Saturday." These have a clear structure. I know which tables to query and what parameters the model needs to supply. Each call completes in under 300ms with data updated seconds ago. A RAG pipeline over POS exports would answer the same queries with data that's hours stale — and would need re-embedding every time the database changed.

Actions, not just information. RAG informs the model. Tools give it agency. Posting a review response, updating a database record, triggering a webhook, sending a Telegram message — these aren't retrieval tasks. Tool use handles them. RAG can't.

Here's the minimal pattern for tool use with Claude. Define the schema, handle the tool_use stop reason, return the result, let the model finish:

import anthropic
import json

client = anthropic.Anthropic()

tools = [
    {
        "name": "get_daily_sales",
        "description": "Fetch total revenue, order count, and top items for a date and location",
        "input_schema": {
            "type": "object",
            "properties": {
                "date": {"type": "string", "description": "YYYY-MM-DD"},
                "location_id": {"type": "string"}
            },
            "required": ["date", "location_id"]
        }
    }
]

def query_agent(user_message: str) -> str:
    response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        tools=tools,
        messages=[{"role": "user", "content": user_message}]
    )

    if response.stop_reason == "tool_use":
        tool_call = next(b for b in response.content if b.type == "tool_use")
        result = execute_tool(tool_call.name, tool_call.input)

        response = client.messages.create(
            model="claude-opus-4-8",
            max_tokens=1024,
            tools=tools,
            messages=[
                {"role": "user", "content": user_message},
                {"role": "assistant", "content": response.content},
                {"role": "user", "content": [
                    {"type": "tool_result", "tool_use_id": tool_call.id, "content": json.dumps(result)}
                ]}
            ]
        )

    return response.content[0].text

Two things worth noting here. The model decides when to call the tool — you don't force it. And the second API call includes the full conversation history including the first assistant response, so Claude has context for why it called the tool when it generates the final answer.

The Hybrid Pattern

Most production agents I've built use both, but for different jobs.

The ShawaMama ops bot runs on tool use for all data retrieval. But it has a 2,000-token system prompt containing static business context: how the owner wants performance framed, which locations he monitors most closely, what format the morning digest should follow. That context doesn't change often. Small enough to include directly in every prompt without building a retrieval layer around it.

For systems where the static context outgrows a system prompt, the hybrid splits cleanly: RAG over the document corpus for background knowledge, tool use for current state and actions. A SaaS support agent might use RAG over the help documentation and a tool call to fetch the specific customer's account data. The model combines both in the final response — one source tells it how the product works, the other tells it what's true for this specific user.

One rule for hybrid systems: don't let the two paths overlap. If RAG is pulling embeddings of database exports and a tool is querying the live database, the model will blend stale and current information in the same response. You'll get answers that mix this morning's revenue with a two-week-old inventory snapshot. The boundary needs to be explicit: RAG for documents, tools for live data. Never both over the same underlying source.

Cost compounds in hybrid systems too. Each tool call is an extra API round trip. A complex query requiring three tool calls pays for one planning call, three execution calls, and one synthesis call — five total interactions. RAG is one embed call, one vector search, one LLM generation. At high volume, the hybrid cost structure matters. Profile it before committing to the architecture.

Four Questions to Pick the Right Pattern

Before writing any retrieval code, I run through four questions.

Does the answer live in a document or a database? Text documents that exist as prose — RAG. Structured data with a clear schema — tool use. If the answer requires querying a table by specific parameters, building an embedding pipeline around that table is the wrong abstraction entirely.

How fresh does the answer need to be? If staleness is a problem — pricing, inventory, user state, live metrics — tool use is the only option. RAG operates over an index built at a specific point in time. Re-indexing on a schedule helps but doesn't solve real-time questions.

Can you enumerate what the model needs to fetch? Tool use requires defined function signatures. If the space of possible queries is "anything a user might ask about this 300-page manual," you can't enumerate that. RAG handles open-ended retrieval. Tool use handles known, bounded query patterns where you can specify inputs and outputs in advance.

Does the agent need to act, not just answer? If the agent sends messages, modifies records, calls webhooks, or triggers workflows, those require tools. RAG only brings information into the model's context. It enables nothing.

Most real systems answer: some knowledge lives in documents, some data is live, some actions need to happen. That's when you build hybrid. Start with tool use for the structured and live parts, then add RAG only if you have a genuine document search problem that can't be solved by defining functions.

RAG vs. Tool Use at a Glance

Dimension RAG Tool Use
Source data type Unstructured text, documents Structured data, APIs, databases
Data freshness Snapshot at index time Always current
Query flexibility Any natural language question over the corpus Requires pre-defined function signatures
Latency profile 50–100ms embed + 5–20ms search + LLM 50–300ms per call + LLM (extra round trip)
Infrastructure Vector store, embedding pipeline API integrations, database connections
Can take actions No — retrieval only Yes — POST requests, DB writes, triggers
Fails when Data changes faster than re-indexing Source is large unstructured text with open-ended queries
Best default for Knowledge base Q&A, document search Live data queries, agent actions

Frequently Asked Questions

Can you use RAG and tool use together in the same agent?

Yes, and most production agents end up doing this. The typical split: RAG for document search and static knowledge, tool use for live data and actions. Keep the paths strictly separate. Mixing stale embedded content with live API data in the same context block produces inconsistent responses where the model can't tell which source to trust.

How does tool use work with Claude specifically?

You define tools as JSON schemas with a name, description, and input parameters. Pass them in the API call alongside your messages. When Claude decides to call a tool, it returns a tool_use content block with the function name and arguments. You execute the function, return the result as a tool_result block, and Claude generates the final response. One extra API round trip per tool call.

When should I switch from RAG to tool use?

Switch when data changes faster than your re-indexing schedule can keep up, when queries are structured and bounded rather than open-ended, or when you measure that direct database calls return fresher and more accurate results for your actual query patterns. The ShawaMama case was all three: live POS data, predictable query structure, and the indexed version was consistently 24 hours behind.

Does vector search add significant latency to RAG?

With pgvector and HNSW indexing, similarity search on 100,000 vectors takes 5–20ms. Embedding the query takes 50–100ms. Retrieval is not the bottleneck for most RAG systems — LLM generation dominates at 500ms to 3s. Streaming the generation response improves perceived latency more than any retrieval optimization. Focus there first before tuning the vector search layer.

Is RAG or tool use better for reducing hallucination?

Both reduce hallucination by grounding the model in external information. RAG's failure mode is low-quality retrieval — wrong chunks, stale content, or marginally relevant material that slips past a weak similarity threshold. Tool use's failure mode is wrong parameters or misinterpreted results. For structured queries where the answer is a specific value, well-implemented tool use produces more reliable answers because the data is exact rather than approximate.

Related Posts

Working on something similar?

I build AI agents and low-latency systems. If you're trying to solve a version of this, let's talk.

Get in touch

Author: Joe Archondis — AI systems engineer and HFT infrastructure builder.

Last updated: 2026-06-28