From Data Engineer to AI Engineer — Part 4: RAG — Making AI Know Your Data

Published on
-
9 mins read
Authors

Series: From Data/Software Engineer to AI Engineer Part 4 of 7 — ← Part 3: Prompt Engineering


The Problem RAG Solves

Imagine your manager asks: "Can we build a chatbot that answers questions about our internal policies?"

You try the naive approach:

# Naive approach
response = llm.ask("What is our remote work policy?")
# Output: "I don't have access to your company's specific remote work policy..."
# Or worse: a confident, completely fabricated answer

Two problems:

  1. The model does not know your data
  2. If it tries to answer anyway, it makes things up (hallucination)

RAG (Retrieval-Augmented Generation) solves both. Instead of asking the model to recall information from its weights, you retrieve the relevant information from your own data at query time and inject it into the context.

# RAG approach
relevant_docs = search_vector_db("remote work policy") # your actual docs
response = llm.ask(f"Based on: {relevant_docs}\n\nWhat is our remote work policy?")
# Output: accurate, grounded answer from your actual policy document

This is not a trick. It is the fundamental pattern for building enterprise AI systems.


The RAG Pipeline — Every Step

Your documents (PDFs, docs, web pages)
[1. Load] — read raw text
[2. Chunk] — split into small pieces
[3. Embed] — convert each chunk to a vector
[4. Store] — save vectors to a vector database
↓ (indexing complete — do once)
User asks a question
[5. Embed query] — convert question to vector
[6. Search] — find chunks with similar vectors
[7. Augment] — inject chunks into LLM prompt
[8. Generate] — LLM answers using retrieved context
[9. Respond] — return answer + source citations

Let us build this step by step.


Step 1–4: Indexing Your Documents

pip install langchain chromadb sentence-transformers pypdf
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, TextLoader
from sentence_transformers import SentenceTransformer
import chromadb
import uuid
# ── Step 1: Load documents ────────────────────────────────
def load_documents(file_path: str):
if file_path.endswith(".pdf"):
loader = PyPDFLoader(file_path)
else:
loader = TextLoader(file_path)
return loader.load()
# ── Step 2: Chunk ────────────────────────────────────────
# This is where most engineers go wrong — chunk size matters enormously
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # ~128 tokens — fits comfortably in context
chunk_overlap=64, # overlap prevents cutting sentences mid-thought
separators=["\n\n", "\n", ".", " "], # try to split on natural boundaries
)
def chunk_documents(documents):
chunks = splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks")
return chunks
# ── Step 3: Embed ────────────────────────────────────────
# Local embedding model — no API cost, works offline
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
# 384-dimension vectors, fast, good quality for English text
def embed_chunks(chunks):
texts = [chunk.page_content for chunk in chunks]
embeddings = embedding_model.encode(texts, batch_size=32, show_progress_bar=True)
return texts, embeddings.tolist()
# ── Step 4: Store in vector database ─────────────────────
client = chromadb.PersistentClient(path="./chroma_db") # saves to disk
collection = client.get_or_create_collection(
name="company_docs",
metadata={"hnsw:space": "cosine"} # cosine similarity (not euclidean)
)
def index_documents(file_path: str):
docs = load_documents(file_path)
chunks = chunk_documents(docs)
texts, embeddings = embed_chunks(chunks)
collection.add(
ids=[str(uuid.uuid4()) for _ in texts],
documents=texts,
embeddings=embeddings,
metadatas=[{
"source": file_path,
"chunk_index": i
} for i in range(len(texts))]
)
print(f"Indexed {len(texts)} chunks from {file_path}")
# Run once to build the index
index_documents("company_remote_policy.pdf")
index_documents("company_expenses_policy.pdf")

Steps 5–9: Querying (The RAG Chain)

import anthropic
llm_client = anthropic.Anthropic()
def rag_query(question: str, n_results: int = 4) -> dict:
# ── Step 5: Embed the query ───────────────────────────
query_embedding = embedding_model.encode([question]).tolist()
# ── Step 6: Retrieve similar chunks ──────────────────
results = collection.query(
query_embeddings=query_embedding,
n_results=n_results,
include=["documents", "metadatas", "distances"]
)
chunks = results["documents"][0]
sources = [m["source"] for m in results["metadatas"][0]]
distances = results["distances"][0]
# Filter out low-relevance chunks (cosine distance > 0.5 means poor match)
relevant_chunks = [
(chunk, source) for chunk, source, dist
in zip(chunks, sources, distances)
if dist < 0.5
]
if not relevant_chunks:
return {
"answer": "I don't have enough relevant information to answer this question.",
"sources": [],
"confidence": 0.0
}
# ── Step 7: Build augmented prompt ───────────────────
context = "\n\n---\n\n".join([
f"[Source: {src}]\n{chunk}"
for chunk, src in relevant_chunks
])
# ── Step 8: Generate answer ───────────────────────────
response = llm_client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system="""
You are a helpful assistant that answers questions based ONLY on the provided context.
Rules:
- Only use information from the provided context
- If the context does not contain the answer, say "The provided documents don't cover this"
- Always cite which document your answer comes from
- Be concise — answer in 3-5 sentences unless the question requires more
""",
messages=[{
"role": "user",
"content": f"""
Context from company documents:
{context}
Question: {question}
"""
}]
)
answer = response.content[0].text
unique_sources = list(set(src for _, src in relevant_chunks))
# ── Step 9: Return answer + sources ──────────────────
return {
"answer": answer,
"sources": unique_sources,
"chunks_used": len(relevant_chunks),
"confidence": 1 - (distances[0] / 1.0) # simple heuristic
}
# Usage
result = rag_query("How many days can I work from home per week?")
print(result["answer"])
# "According to the Remote Work Policy, employees can work from home up to 3 days per week..."
print(f"Sources: {result['sources']}")
# Sources: ['company_remote_policy.pdf']

Chunking: Where RAG Lives or Dies

The most common reason RAG systems fail is bad chunking. Here are the strategies:

Strategy 1: Fixed-Size (Baseline)

RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
  • Simple, predictable
  • Can cut sentences mid-thought
  • Good starting point for homogeneous text

Strategy 2: Sentence-Aware

from langchain.text_splitter import NLTKTextSplitter
splitter = NLTKTextSplitter(chunk_size=500)
  • Respects sentence boundaries
  • Better for Q&A on factual documents
  • Slightly more setup

Strategy 3: Semantic Chunking (Best Quality)

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
)
  • Splits on topic shifts, not arbitrary character counts
  • Best retrieval quality
  • Slower and more expensive

Which to use?

Document typeRecommended strategy
Policy documents, FAQsFixed-size with overlap
Legal documents, contractsSentence-aware
Mixed content (reports, books)Semantic chunking
Code documentationFixed-size, larger chunks (1024)
Chat historyMessage-by-message

Making Retrieval Better: The Three Upgrades

Once your baseline RAG works, these three upgrades give the biggest quality improvements:

Upgrade 1: Hybrid Search (Dense + Sparse)

Pure vector search misses exact keyword matches. Hybrid search combines:

  • Dense search (vector/semantic) — finds conceptually similar chunks
  • Sparse search (BM25/keyword) — finds exact term matches
# Using Azure AI Search hybrid search
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
results = search_client.search(
search_text=question, # BM25 keyword search
vector_queries=[VectorizedQuery(
vector=query_embedding, # Dense semantic search
k_nearest_neighbors=4,
fields="content_vector"
)],
top=4
)

When to use it: Always, if your infrastructure supports it. Hybrid search consistently outperforms pure vector search.

Upgrade 2: Reranker

After retrieving top-20 chunks by vector similarity, run a cross-encoder reranker to rescore and reorder them.

from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, chunks: list[str], top_k: int = 4) -> list[str]:
pairs = [(query, chunk) for chunk in chunks]
scores = reranker.predict(pairs)
ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
return [chunk for chunk, _ in ranked[:top_k]]
# Usage: retrieve 20, rerank to top 4
initial_results = retrieve(query, n=20)
final_results = rerank(query, initial_results, top_k=4)

Why it helps: Vector similarity finds broadly similar chunks. The reranker asks "is this specific chunk actually useful for answering this specific question?" — a harder, more accurate question.

Upgrade 3: Query Rewriting

Sometimes the user's raw question is not the best search query.

def rewrite_query(user_question: str) -> str:
response = llm_client.messages.create(
model="claude-sonnet-4-6",
max_tokens=100,
messages=[{
"role": "user",
"content": f"""
Rewrite this user question as a search query optimised for finding
relevant documents in a corporate knowledge base.
Return only the rewritten query, nothing else.
Original question: "{user_question}"
"""
}]
)
return response.content[0].text.strip()
# Example
original = "How many days WFH do I get?"
rewritten = rewrite_query(original)
# "remote work policy work from home days per week allowance"

Evaluating Your RAG System

You cannot improve what you do not measure. These are the four metrics that matter:

# pip install ragas datasets
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
# Build a golden dataset — 20-50 known Q&A pairs with expected answers
golden_dataset = [
{
"question": "How many WFH days per week?",
"ground_truth": "3 days per week",
"answer": rag_query("How many WFH days per week?")["answer"],
"contexts": rag_query("How many WFH days per week?")["chunks_used"]
},
# ... more examples
]
results = evaluate(
dataset=golden_dataset,
metrics=[faithfulness, answer_relevancy, context_precision]
)
print(results)
# faithfulness: 0.94 (is the answer grounded in the retrieved chunks?)
# answer_relevancy: 0.89 (does the answer address the question?)
# context_precision: 0.87 (are the retrieved chunks actually relevant?)
MetricAcceptableGoodExcellent
Faithfulness> 0.80> 0.90> 0.95
Answer relevancy> 0.75> 0.85> 0.90
Context precision> 0.70> 0.80> 0.90

If faithfulness is low: your LLM is ignoring the retrieved chunks and hallucinating — strengthen the system prompt constraints.

If context precision is low: your retrieval is poor — try hybrid search or a reranker.


When RAG Is Not Enough

RAG is not the solution to every problem. Know when it fails:

ProblemRAG sufficient?What to do instead
Summarise 500-page reportNo (too large)Hierarchical chunking + map-reduce
Answer from internal docsYesStandard RAG
Answer from real-time dataPartialRAG + live data tool call
Write in our brand voiceNoFine-tune on brand examples
Complex multi-doc reasoningPartialAgents with RAG tools
Personal question ("my orders")NoAPI tool call with user context

The Complete RAG Stack at a Glance

Production RAG Stack:
Documents → [Loader] → [Chunker] → [Embedder] → [Vector DB]
User Query → [Query Rewriter] → [Embedder] → [Retriever] → [Reranker]
[LLM] ← [Context Injector]
[Output Validator]
Answer + Sources + Confidence Score

Choosing Your Vector Database

ToolBest forProsCons
ChromaDBLocal dev / POCsZero setup, Python-nativeNot production-scale
pgvectorExisting PostgreSQLNo new infraSlower ANN search
PineconeManaged productionSimple API, scalableCost at scale
Azure AI SearchAzure-nativeHybrid search built-in, enterprise featuresAzure dependency
Databricks Vector SearchDatabricksDelta table-backed, auto-syncDatabricks only
WeaviateComplex schemasGraphQL, multi-tenancyMore setup

For M&S/Azure environments: Azure AI Search is the natural choice — hybrid search, private endpoints, compliance built-in.


Summary

StepWhat you doTool
LoadRead raw documentsPyPDFLoader, TextLoader
ChunkSplit into 512-token pieces with overlapRecursiveCharacterTextSplitter
EmbedConvert to vectorsSentenceTransformer / OpenAI
StoreSave to vector databaseChromaDB / pgvector / Azure AI Search
SearchFind similar chunks to queryCosine similarity
RerankRescore top chunksCrossEncoder
GenerateLLM answers with contextClaude / GPT-4o
EvaluateMeasure faithfulness + relevancyRAGAS

Next: Part 5 — AI Agents: Making AI Take Actions

In Part 5, we go beyond Q&A. Agents can call APIs, run code, search the web, and make multi-step decisions. You will build one from scratch.