I Rebuilt My RAG Pipeline from Scratch — Here's What Actually Matters

I rebuilt my RAG pipeline last month. Not because it was broken — because it was outdated.

The version I had running since mid-2025 was a classic simple RAG setup: chunk documents, embed them, store in a vector database, retrieve the top-k most similar chunks when the user asks a question, feed them into the LLM as context. It worked. For about six months. Then it started failing in ways that were subtle enough to be dangerous.

The failures weren't crashes or errors. They were wrong answers delivered with high confidence. The system would retrieve semantically similar chunks that were factually irrelevant. It would miss documents that were clearly relevant but phrased differently. It would blend information from multiple sources in ways that created plausible-sounding nonsense.

Sound familiar? If you've built a RAG system, you've probably seen the same thing. Here's what I learned rebuilding mine from the ground up.

The Problem with Simple RAG

Simple RAG has a seductive elegance. Embed your documents. Retrieve the relevant ones. Generate an answer. Three steps. Clean architecture. Easy to reason about.

But the simplicity hides three fundamental problems:

Retrieval accuracy degrades with scale. When you have 100 documents, top-k retrieval works beautifully. When you have 10,000, it starts returning noise. When you have 100,000, the signal-to-noise ratio becomes unacceptable. Semantic similarity is a blunt instrument — it finds chunks that sound like the query, not chunks that answer the query.

Single-hop retrieval can't handle complex questions. "What's our revenue policy?" is a single-hop question — one retrieval pass finds the answer. But "How does our revenue policy compare to our competitor's, and what are the implications for our Q3 strategy?" requires multiple retrieval passes, cross-referencing, and synthesis. Simple RAG can't do this.

Context windows are not infinite. Even with 128K or 200K context models, you can't just dump everything in. More context doesn't mean better answers — it means more noise, higher latency, and higher cost. The art is retrieving precisely what's needed and nothing else.

What Changed: The Agentic RAG Architecture

The rebuild didn't add more retrieval. It added intelligence to the retrieval process itself.

The new architecture has three layers:

Layer 1: Hybrid Retrieval. Instead of pure vector search, I combined dense vector search (for semantic similarity) with sparse BM25 search (for keyword matching). The results are merged using reciprocal rank fusion. This catches documents that are semantically relevant AND documents that contain exact terms the vector search might miss.

Layer 2: Query Decomposition. Before retrieval, the system analyzes the user's question. Complex questions are decomposed into sub-queries. Each sub-query runs its own retrieval pass. The results are aggregated, deduplicated, and ranked. This is the "agentic" part — the system reasons about what it needs before it retrieves.

Layer 3: Verification and Re-ranking. After retrieval, a lightweight model scores each retrieved chunk for actual relevance to the original question (not just semantic similarity). Chunks below a threshold are discarded. The remaining chunks are re-ranked by relevance. Only then are they passed to the generator.

The result: dramatically better answers with fewer tokens in the context window.

The Implementation Details That Matter

Chunking Strategy

I switched from fixed-size chunks (512 tokens) to semantic chunking — splitting documents at natural boundaries (paragraphs, sections, topic shifts). This preserves context within each chunk and reduces the number of chunks that contain irrelevant mixed content.

The chunk size matters less than the chunk quality. A 200-token chunk that contains one complete thought is more useful than a 512-token chunk that contains fragments of two thoughts.

Embedding Model Selection

I tested three embedding models: OpenAI's text-embedding-3-large, Cohere's embed-v4, and the open-source gte-Qwen2-instruct. For my use case (technical documentation in English), gte-Qwen2 performed comparably to the commercial options at zero marginal cost.

The key insight: the embedding model matters less than the retrieval pipeline. A mediocre embedding model with hybrid retrieval and re-ranking outperforms an excellent embedding model with simple top-k retrieval.

Semantic Caching

Every query to the RAG system costs money (embedding + retrieval + generation). Semantic caching stores recent query-response pairs and checks incoming queries against them. If a new query is semantically similar enough to a cached query, the cached response is returned directly.

This reduced my LLM costs by approximately 40% for a documentation chatbot where many questions are variations of the same underlying query.

The Polystore Pattern

Here's something I didn't expect to matter as much as it did: using Postgres alongside the vector database. Not instead of — alongside.

The vector database handles semantic retrieval. But metadata, permissions, versioning, and audit logs live in Postgres. When a user asks a question, the system first checks Postgres for permission boundaries (which documents can this user access?), then queries the vector database within those boundaries.

This "polystore" architecture — relational + vector — is becoming the default for production RAG systems. It gives you semantic understanding and transactional reliability in the same pipeline.

What I'd Do Differently

Start with evaluation, not implementation. I built the v1 pipeline without a proper evaluation framework. When it started failing, I couldn't measure how badly it was failing or whether my fixes actually improved things. The v2 pipeline started with a test suite of 50 question-answer pairs that I validate against after every change.

Invest in observability early. Knowing that the system returned a bad answer is useless without knowing why. I now log every step: the original query, the decomposed sub-queries, the retrieved chunks and their scores, the re-ranking results, and the final generated answer. When something goes wrong, I can trace exactly where in the pipeline the failure occurred.

Don't over-optimize for benchmarks. My v2 pipeline scores higher on every metric I track. But the improvements that users noticed most weren't the ones that showed up in benchmarks — they were the reduction in confidently wrong answers. Users tolerate "I don't know" much better than plausible misinformation.

Where RAG Is Heading

The interesting trend is "context engines" — unified platforms that handle all forms of context for AI agents: persistent memory, ephemeral context, structured data, unstructured documents, real-time data feeds. Instead of building separate pipelines for each data type, you have a single abstraction layer that the agent queries for whatever it needs.

We're not there yet. But the direction is clear: RAG is evolving from a retrieval technique into an autonomous knowledge runtime. The agent doesn't just retrieve and generate — it plans what to retrieve, verifies what it found, and governs what it shares. That's a fundamentally different architecture from "embed, retrieve, generate."

For builders, the practical advice is simple: if your RAG pipeline is more than six months old, it's probably time for a rebuild. The tooling has improved dramatically, the architectural patterns have matured, and the difference between a well-built and a poorly-built RAG system is the difference between a product people trust and a product they abandon.

Build the one people trust.