From Data Engineer to AI Engineer — Part 4: RAG — Making AI Know Your Data

Series: From Data/Software Engineer to AI Engineer Part 4 of 7 — ← Part 3: Prompt Engineering

The Problem RAG Solves

Imagine your manager asks: "Can we build a chatbot that answers questions about our internal policies?"

You try the naive approach:

# Naive approach
response = llm.ask("What is our remote work policy?")
# Output: "I don't have access to your company's specific remote work policy..."
# Or worse: a confident, completely fabricated answer

Two problems:

The model does not know your data
If it tries to answer anyway, it makes things up (hallucination)

RAG (Retrieval-Augmented Generation) solves both. Instead of asking the model to recall information from its weights, you retrieve the relevant information from your own data at query time and inject it into the context.

# RAG approach
relevant_docs = search_vector_db("remote work policy")  # your actual docs
response = llm.ask(f"Based on: {relevant_docs}\n\nWhat is our remote work policy?")
# Output: accurate, grounded answer from your actual policy document

This is not a trick. It is the fundamental pattern for building enterprise AI systems.

The RAG Pipeline — Every Step

Your documents (PDFs, docs, web pages)
        ↓
[1. Load] — read raw text
        ↓
[2. Chunk] — split into small pieces
        ↓
[3. Embed] — convert each chunk to a vector
        ↓
[4. Store] — save vectors to a vector database
        ↓  (indexing complete — do once)
        
User asks a question
        ↓
[5. Embed query] — convert question to vector
        ↓
[6. Search] — find chunks with similar vectors
        ↓
[7. Augment] — inject chunks into LLM prompt
        ↓
[8. Generate] — LLM answers using retrieved context
        ↓
[9. Respond] — return answer + source citations

Let us build this step by step.

Step 1–4: Indexing Your Documents

pip install langchain chromadb sentence-transformers pypdf

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, TextLoader
from sentence_transformers import SentenceTransformer
import chromadb
import uuid

# ── Step 1: Load documents ────────────────────────────────
def load_documents(file_path: str):
    if file_path.endswith(".pdf"):
        loader = PyPDFLoader(file_path)
    else:
        loader = TextLoader(file_path)
    return loader.load()

# ── Step 2: Chunk ────────────────────────────────────────
# This is where most engineers go wrong — chunk size matters enormously
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,      # ~128 tokens — fits comfortably in context
    chunk_overlap=64,    # overlap prevents cutting sentences mid-thought
    separators=["\n\n", "\n", ".", " "],  # try to split on natural boundaries
)

def chunk_documents(documents):
    chunks = splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks")
    return chunks

# ── Step 3: Embed ────────────────────────────────────────
# Local embedding model — no API cost, works offline
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
# 384-dimension vectors, fast, good quality for English text

def embed_chunks(chunks):
    texts = [chunk.page_content for chunk in chunks]
    embeddings = embedding_model.encode(texts, batch_size=32, show_progress_bar=True)
    return texts, embeddings.tolist()

# ── Step 4: Store in vector database ─────────────────────
client = chromadb.PersistentClient(path="./chroma_db")  # saves to disk
collection = client.get_or_create_collection(
    name="company_docs",
    metadata={"hnsw:space": "cosine"}  # cosine similarity (not euclidean)
)

def index_documents(file_path: str):
    docs = load_documents(file_path)
    chunks = chunk_documents(docs)
    texts, embeddings = embed_chunks(chunks)
    
    collection.add(
        ids=[str(uuid.uuid4()) for _ in texts],
        documents=texts,
        embeddings=embeddings,
        metadatas=[{
            "source": file_path,
            "chunk_index": i
        } for i in range(len(texts))]
    )
    print(f"Indexed {len(texts)} chunks from {file_path}")

# Run once to build the index
index_documents("company_remote_policy.pdf")
index_documents("company_expenses_policy.pdf")

Steps 5–9: Querying (The RAG Chain)

import anthropic

llm_client = anthropic.Anthropic()

def rag_query(question: str, n_results: int = 4) -> dict:
    
    # ── Step 5: Embed the query ───────────────────────────
    query_embedding = embedding_model.encode([question]).tolist()
    
    # ── Step 6: Retrieve similar chunks ──────────────────
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=n_results,
        include=["documents", "metadatas", "distances"]
    )
    
    chunks = results["documents"][0]
    sources = [m["source"] for m in results["metadatas"][0]]
    distances = results["distances"][0]
    
    # Filter out low-relevance chunks (cosine distance > 0.5 means poor match)
    relevant_chunks = [
        (chunk, source) for chunk, source, dist 
        in zip(chunks, sources, distances) 
        if dist < 0.5
    ]
    
    if not relevant_chunks:
        return {
            "answer": "I don't have enough relevant information to answer this question.",
            "sources": [],
            "confidence": 0.0
        }
    
    # ── Step 7: Build augmented prompt ───────────────────
    context = "\n\n---\n\n".join([
        f"[Source: {src}]\n{chunk}" 
        for chunk, src in relevant_chunks
    ])
    
    # ── Step 8: Generate answer ───────────────────────────
    response = llm_client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system="""
        You are a helpful assistant that answers questions based ONLY on the provided context.
        
        Rules:
        - Only use information from the provided context
        - If the context does not contain the answer, say "The provided documents don't cover this"
        - Always cite which document your answer comes from
        - Be concise — answer in 3-5 sentences unless the question requires more
        """,
        messages=[{
            "role": "user",
            "content": f"""
Context from company documents:

{context}

Question: {question}
"""
        }]
    )
    
    answer = response.content[0].text
    unique_sources = list(set(src for _, src in relevant_chunks))
    
    # ── Step 9: Return answer + sources ──────────────────
    return {
        "answer": answer,
        "sources": unique_sources,
        "chunks_used": len(relevant_chunks),
        "confidence": 1 - (distances[0] / 1.0)  # simple heuristic
    }

# Usage
result = rag_query("How many days can I work from home per week?")
print(result["answer"])
# "According to the Remote Work Policy, employees can work from home up to 3 days per week..."
print(f"Sources: {result['sources']}")
# Sources: ['company_remote_policy.pdf']

Chunking: Where RAG Lives or Dies

The most common reason RAG systems fail is bad chunking. Here are the strategies:

Strategy 1: Fixed-Size (Baseline)

RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)

Simple, predictable
Can cut sentences mid-thought
Good starting point for homogeneous text

Strategy 2: Sentence-Aware

from langchain.text_splitter import NLTKTextSplitter
splitter = NLTKTextSplitter(chunk_size=500)

Respects sentence boundaries
Better for Q&A on factual documents
Slightly more setup

Strategy 3: Semantic Chunking (Best Quality)

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile"
)

Splits on topic shifts, not arbitrary character counts
Best retrieval quality
Slower and more expensive

Which to use?

Document type	Recommended strategy
Policy documents, FAQs	Fixed-size with overlap
Legal documents, contracts	Sentence-aware
Mixed content (reports, books)	Semantic chunking
Code documentation	Fixed-size, larger chunks (1024)
Chat history	Message-by-message

Making Retrieval Better: The Three Upgrades

Once your baseline RAG works, these three upgrades give the biggest quality improvements:

Upgrade 1: Hybrid Search (Dense + Sparse)

Pure vector search misses exact keyword matches. Hybrid search combines:

Dense search (vector/semantic) — finds conceptually similar chunks
Sparse search (BM25/keyword) — finds exact term matches

# Using Azure AI Search hybrid search
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery

results = search_client.search(
    search_text=question,          # BM25 keyword search
    vector_queries=[VectorizedQuery(
        vector=query_embedding,    # Dense semantic search
        k_nearest_neighbors=4,
        fields="content_vector"
    )],
    top=4
)

When to use it: Always, if your infrastructure supports it. Hybrid search consistently outperforms pure vector search.

Upgrade 2: Reranker

After retrieving top-20 chunks by vector similarity, run a cross-encoder reranker to rescore and reorder them.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, chunks: list[str], top_k: int = 4) -> list[str]:
    pairs = [(query, chunk) for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [chunk for chunk, _ in ranked[:top_k]]

# Usage: retrieve 20, rerank to top 4
initial_results = retrieve(query, n=20)
final_results = rerank(query, initial_results, top_k=4)

Why it helps: Vector similarity finds broadly similar chunks. The reranker asks "is this specific chunk actually useful for answering this specific question?" — a harder, more accurate question.

Upgrade 3: Query Rewriting

Sometimes the user's raw question is not the best search query.

def rewrite_query(user_question: str) -> str:
    response = llm_client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"""
            Rewrite this user question as a search query optimised for finding 
            relevant documents in a corporate knowledge base.
            
            Return only the rewritten query, nothing else.
            
            Original question: "{user_question}"
            """
        }]
    )
    return response.content[0].text.strip()

# Example
original = "How many days WFH do I get?"
rewritten = rewrite_query(original)
# "remote work policy work from home days per week allowance"

Evaluating Your RAG System

You cannot improve what you do not measure. These are the four metrics that matter:

# pip install ragas datasets
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Build a golden dataset — 20-50 known Q&A pairs with expected answers
golden_dataset = [
    {
        "question": "How many WFH days per week?",
        "ground_truth": "3 days per week",
        "answer": rag_query("How many WFH days per week?")["answer"],
        "contexts": rag_query("How many WFH days per week?")["chunks_used"]
    },
    # ... more examples
]

results = evaluate(
    dataset=golden_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

print(results)
# faithfulness: 0.94 (is the answer grounded in the retrieved chunks?)
# answer_relevancy: 0.89 (does the answer address the question?)
# context_precision: 0.87 (are the retrieved chunks actually relevant?)

Metric	Acceptable	Good	Excellent
Faithfulness	> 0.80	> 0.90	> 0.95
Answer relevancy	> 0.75	> 0.85	> 0.90
Context precision	> 0.70	> 0.80	> 0.90

If faithfulness is low: your LLM is ignoring the retrieved chunks and hallucinating — strengthen the system prompt constraints.

If context precision is low: your retrieval is poor — try hybrid search or a reranker.

When RAG Is Not Enough

RAG is not the solution to every problem. Know when it fails:

Problem	RAG sufficient?	What to do instead
Summarise 500-page report	No (too large)	Hierarchical chunking + map-reduce
Answer from internal docs	Yes	Standard RAG
Answer from real-time data	Partial	RAG + live data tool call
Write in our brand voice	No	Fine-tune on brand examples
Complex multi-doc reasoning	Partial	Agents with RAG tools
Personal question ("my orders")	No	API tool call with user context

The Complete RAG Stack at a Glance

Production RAG Stack:

Documents → [Loader] → [Chunker] → [Embedder] → [Vector DB]
                                                       ↑
User Query → [Query Rewriter] → [Embedder] → [Retriever] → [Reranker]
                                                                 ↓
                                               [LLM] ← [Context Injector]
                                                 ↓
                                          [Output Validator]
                                                 ↓
                                   Answer + Sources + Confidence Score

Choosing Your Vector Database

Tool	Best for	Pros	Cons
ChromaDB	Local dev / POCs	Zero setup, Python-native	Not production-scale
pgvector	Existing PostgreSQL	No new infra	Slower ANN search
Pinecone	Managed production	Simple API, scalable	Cost at scale
Azure AI Search	Azure-native	Hybrid search built-in, enterprise features	Azure dependency
Databricks Vector Search	Databricks	Delta table-backed, auto-sync	Databricks only
Weaviate	Complex schemas	GraphQL, multi-tenancy	More setup

For enterprise Azure environments: Azure AI Search is the natural choice — hybrid search, private endpoints, compliance built-in.

Summary

Step	What you do	Tool
Load	Read raw documents	PyPDFLoader, TextLoader
Chunk	Split into 512-token pieces with overlap	RecursiveCharacterTextSplitter
Embed	Convert to vectors	SentenceTransformer / OpenAI
Store	Save to vector database	ChromaDB / pgvector / Azure AI Search
Search	Find similar chunks to query	Cosine similarity
Rerank	Rescore top chunks	CrossEncoder
Generate	LLM answers with context	Claude / GPT-4o
Evaluate	Measure faithfulness + relevancy	RAGAS

Next: Part 5 — AI Agents: Making AI Take Actions

In Part 5, we go beyond Q&A. Agents can call APIs, run code, search the web, and make multi-step decisions. You will build one from scratch.