Building a Production-Ready RAG Pipeline: From Prototype to Governed AI

The gap between a RAG demo and a RAG system

I have seen the same pattern play out multiple times. An engineer builds a convincing RAG prototype in an afternoon — documents in, questions out, answers look good. Then the enterprise adoption conversation starts and the questions shift: Who asked what? How do we know it didn't hallucinate? What if a user pastes in a customer's email address? Can we explain why it said that?

The prototype cannot answer any of those questions. This post is about building RAG the other way — governance-first — so that when those questions come, you already have the answers.

Architecture overview

┌─────────────┐    ┌──────────────┐    ┌─────────────────┐    ┌──────────────┐
│  Documents  │───▶│   Ingest     │───▶│   ChromaDB      │    │  Audit Logs  │
│  (txt/pdf)  │    │  chunk+embed │    │  (vector store) │    │  (JSONL)     │
└─────────────┘    └──────────────┘    └────────┬────────┘    └──────────────┘
                                                │                      ▲
                   ┌──────────────┐             │               ┌──────┴───────┐
                   │  User Query  │────────────▶│  Retriever   │  Governance  │
                   └──────────────┘             │               │  Layer       │
                                                ▼               └──────┬───────┘
                   ┌──────────────┐    ┌────────────────┐             │
                   │   Response   │◀───│  Claude (LLM)  │◀────────────┘
                   │  + metadata  │    └────────────────┘
                   └──────────────┘

Two pipelines run separately:

Ingestion (offline) — load documents, chunk, embed, store in ChromaDB
Query (real-time) — embed question, retrieve chunks, call LLM, return governed response

Keeping them separate is important. It means you can re-index documents without touching the serving layer, and you can update the LLM without re-embedding everything.

Chunking strategy: the decision most people skip

Chunking is where most RAG implementations go wrong quietly. Too large and your retrieval is imprecise. Too small and you lose context. The right answer depends on your documents.

I use RecursiveCharacterTextSplitter from LangChain with 512-token chunks and 64-token overlap:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""],
)

The separators list matters. It tries to split on paragraph breaks first, then line breaks, then sentences. This preserves semantic units and avoids splitting mid-sentence. The 64-token overlap ensures that context spanning a chunk boundary is not lost.

Strategy	Chunk size	Overlap	Best for
Recursive character	512 tokens	64 tokens	Mixed prose/code
Sentence-level	~3 sentences	1 sentence	Precise Q&A
Semantic chunking	Variable	None	Domain-specific corpora

For a retail knowledge base (policies, product descriptions, FAQs), the recursive approach with 512/64 is a solid default. For clinical documents or legal contracts, sentence-level chunking usually performs better.

Embeddings: local vs API

I deliberately chose sentence-transformers (all-MiniLM-L6-v2) over an API-based embedding model. There are three reasons:

No data leaves the system during ingestion — critical for documents that may contain commercially sensitive information
Zero per-request cost — embedding 100,000 chunks costs nothing after the initial download
Sufficient quality — for semantic search over domain documents, MiniLM-L6-v2 performs comparably to text-embedding-ada-002 on most benchmarks

The trade-off is that domain-specific embedding models (fine-tuned on retail or clinical text) will outperform general-purpose models. That is a Track B improvement once you have production volume.

The governance layer

This is the part I care most about. Four hooks fire on every request:

1. PII detection (pre-query)

_PII_PATTERNS = [
    (re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"), "<EMAIL>"),
    (re.compile(r"\b(?:\+44\s?7\d{3}|\(?07\d{3}\)?)\s?\d{3}\s?\d{3}\b"), "<UK_PHONE>"),
    (re.compile(r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b"), "<CARD_NUMBER>"),
]

def mask_pii(text: str) -> str:
    for pattern, label in _PII_PATTERNS:
        text = pattern.sub(label, text)
    return text

The question is masked before it ever reaches the LLM. In production, swap this for Microsoft Presidio which handles 50+ entity types across 15 languages, including NI numbers, NHS IDs, and passport numbers.

2. Audit logging (every request)

def audit_log(event, query, response=None, metadata=None, user_id="anonymous"):
    record = {
        "id": str(uuid.uuid4()),
        "ts": datetime.now(timezone.utc).isoformat(),
        "event": event,
        "user_id": user_id,
        "query_length": len(query),
        "response_length": len(response) if response else 0,
        **(metadata or {}),
    }
    with open(cfg.audit_log_file, "a") as f:
        f.write(json.dumps(record) + "\n")

Every query and every response gets a UUID and a timestamp. The log never stores the raw query text — only the length and PII-masked version. This is intentional: GDPR requires you to minimise data collection even in audit systems.

In production, pipe this JSONL stream to Splunk, Datadog, or Azure Monitor. You want to be able to answer "which users queried about X last Tuesday" without touching application code.

3. Confidence scoring (post-generation)

def score_confidence(response: str, source_chunks: list[str]) -> float:
    response_lower = response.lower()
    uncertainty_hits = sum(1 for s in _UNCERTAINTY_SIGNALS if s in response_lower)
    base_score = max(0.0, 1.0 - (uncertainty_hits * 0.2))
    
    # Boost if response tokens overlap with source chunks
    response_tokens = set(response_lower.split())
    source_tokens = set(" ".join(source_chunks).lower().split())
    if source_tokens:
        overlap = len(response_tokens & source_tokens) / max(len(response_tokens), 1)
        base_score = min(1.0, base_score + overlap * 0.3)
    
    return round(base_score, 2)

This is a heuristic. A production confidence scorer uses semantic similarity between the response and retrieved chunks — but the heuristic catches the easy cases: when the LLM says "I don't know" or "I cannot find", the score drops and a disclaimer is shown.

4. Source attribution (always)

Every response includes the source documents it was built from. This is non-negotiable in enterprise deployments. Users need to be able to verify the answer, and you need to be able to explain it if challenged.

Latency vs quality trade-offs

The main levers are:

Lever	Faster	More accurate
Embedding model	MiniLM (local)	Large API model
Top-k retrieval	k=3	k=10 with re-ranking
LLM	Haiku	Opus/Sonnet
Chunk size	Larger (fewer API calls)	Smaller (more precise)

For a customer-facing feature with a < 3-second SLA, I would use: MiniLM embeddings, k=5, Claude Haiku, 512-token chunks. For an internal analyst tool where accuracy matters more than speed, I would use re-ranking and Sonnet.

What "well-governed AI" means in practice

Enterprise stakeholders use the term loosely. Here is what it actually requires:

Traceability — every output can be traced to its inputs (sources, user, timestamp)
Containment — PII and sensitive data do not leak into logs or external APIs
Uncertainty communication — the system does not present low-confidence answers with the same confidence as high-confidence ones
Human escalation path — when confidence is low or the question is out of scope, there is a defined path to a human expert
Model cards — documented description of what the model can and cannot do, trained on what data, known failure modes

The first three are implemented in this POC. The last two are organisational decisions that no amount of code can substitute.

What comes next

This POC demonstrates the core pipeline. A production system would add:

Re-ranking with a cross-encoder to improve precision at the cost of latency
Presidio integration for production-grade PII handling
MLflow model registry for tracking which embedding model and LLM version is in production
Drift monitoring — track retrieval quality over time as the document corpus evolves
A/B testing framework for comparing chunking strategies or LLM versions without a full deployment

The full source code is in my learning POCs repository under vijay_learn_pocs/genai_rag_poc/.

The core insight here is that RAG governance is not a feature you add later. By the time you are dealing with an incident — a hallucinated policy, a leaked email address, a user who cannot get a straight answer — it is too late to retrofit traceability. Build the audit log on day one, even if you never look at it.