From Data Engineer to AI Engineer — Part 4: RAG — Making AI Know Your Data
- Published on
- -9 mins read
- Authors
- Name
- Vijay Anand Pandian
- @vijayanandrp
Series: From Data/Software Engineer to AI Engineer Part 4 of 7 — ← Part 3: Prompt Engineering
The Problem RAG Solves
Imagine your manager asks: "Can we build a chatbot that answers questions about our internal policies?"
You try the naive approach:
# Naive approachresponse = llm.ask("What is our remote work policy?")# Output: "I don't have access to your company's specific remote work policy..."# Or worse: a confident, completely fabricated answerTwo problems:
- The model does not know your data
- If it tries to answer anyway, it makes things up (hallucination)
RAG (Retrieval-Augmented Generation) solves both. Instead of asking the model to recall information from its weights, you retrieve the relevant information from your own data at query time and inject it into the context.
# RAG approachrelevant_docs = search_vector_db("remote work policy") # your actual docsresponse = llm.ask(f"Based on: {relevant_docs}\n\nWhat is our remote work policy?")# Output: accurate, grounded answer from your actual policy documentThis is not a trick. It is the fundamental pattern for building enterprise AI systems.
The RAG Pipeline — Every Step
Your documents (PDFs, docs, web pages) ↓[1. Load] — read raw text ↓[2. Chunk] — split into small pieces ↓[3. Embed] — convert each chunk to a vector ↓[4. Store] — save vectors to a vector database ↓ (indexing complete — do once) User asks a question ↓[5. Embed query] — convert question to vector ↓[6. Search] — find chunks with similar vectors ↓[7. Augment] — inject chunks into LLM prompt ↓[8. Generate] — LLM answers using retrieved context ↓[9. Respond] — return answer + source citationsLet us build this step by step.
Step 1–4: Indexing Your Documents
pip install langchain chromadb sentence-transformers pypdffrom langchain.text_splitter import RecursiveCharacterTextSplitterfrom langchain.document_loaders import PyPDFLoader, TextLoaderfrom sentence_transformers import SentenceTransformerimport chromadbimport uuid
# ── Step 1: Load documents ────────────────────────────────def load_documents(file_path: str): if file_path.endswith(".pdf"): loader = PyPDFLoader(file_path) else: loader = TextLoader(file_path) return loader.load()
# ── Step 2: Chunk ────────────────────────────────────────# This is where most engineers go wrong — chunk size matters enormouslysplitter = RecursiveCharacterTextSplitter( chunk_size=512, # ~128 tokens — fits comfortably in context chunk_overlap=64, # overlap prevents cutting sentences mid-thought separators=["\n\n", "\n", ".", " "], # try to split on natural boundaries)
def chunk_documents(documents): chunks = splitter.split_documents(documents) print(f"Split {len(documents)} documents into {len(chunks)} chunks") return chunks
# ── Step 3: Embed ────────────────────────────────────────# Local embedding model — no API cost, works offlineembedding_model = SentenceTransformer("all-MiniLM-L6-v2")# 384-dimension vectors, fast, good quality for English text
def embed_chunks(chunks): texts = [chunk.page_content for chunk in chunks] embeddings = embedding_model.encode(texts, batch_size=32, show_progress_bar=True) return texts, embeddings.tolist()
# ── Step 4: Store in vector database ─────────────────────client = chromadb.PersistentClient(path="./chroma_db") # saves to diskcollection = client.get_or_create_collection( name="company_docs", metadata={"hnsw:space": "cosine"} # cosine similarity (not euclidean))
def index_documents(file_path: str): docs = load_documents(file_path) chunks = chunk_documents(docs) texts, embeddings = embed_chunks(chunks) collection.add( ids=[str(uuid.uuid4()) for _ in texts], documents=texts, embeddings=embeddings, metadatas=[{ "source": file_path, "chunk_index": i } for i in range(len(texts))] ) print(f"Indexed {len(texts)} chunks from {file_path}")
# Run once to build the indexindex_documents("company_remote_policy.pdf")index_documents("company_expenses_policy.pdf")Steps 5–9: Querying (The RAG Chain)
import anthropic
llm_client = anthropic.Anthropic()
def rag_query(question: str, n_results: int = 4) -> dict: # ── Step 5: Embed the query ─────────────────────────── query_embedding = embedding_model.encode([question]).tolist() # ── Step 6: Retrieve similar chunks ────────────────── results = collection.query( query_embeddings=query_embedding, n_results=n_results, include=["documents", "metadatas", "distances"] ) chunks = results["documents"][0] sources = [m["source"] for m in results["metadatas"][0]] distances = results["distances"][0] # Filter out low-relevance chunks (cosine distance > 0.5 means poor match) relevant_chunks = [ (chunk, source) for chunk, source, dist in zip(chunks, sources, distances) if dist < 0.5 ] if not relevant_chunks: return { "answer": "I don't have enough relevant information to answer this question.", "sources": [], "confidence": 0.0 } # ── Step 7: Build augmented prompt ─────────────────── context = "\n\n---\n\n".join([ f"[Source: {src}]\n{chunk}" for chunk, src in relevant_chunks ]) # ── Step 8: Generate answer ─────────────────────────── response = llm_client.messages.create( model="claude-sonnet-4-6", max_tokens=512, system=""" You are a helpful assistant that answers questions based ONLY on the provided context. Rules: - Only use information from the provided context - If the context does not contain the answer, say "The provided documents don't cover this" - Always cite which document your answer comes from - Be concise — answer in 3-5 sentences unless the question requires more """, messages=[{ "role": "user", "content": f"""Context from company documents:
{context}
Question: {question}""" }] ) answer = response.content[0].text unique_sources = list(set(src for _, src in relevant_chunks)) # ── Step 9: Return answer + sources ────────────────── return { "answer": answer, "sources": unique_sources, "chunks_used": len(relevant_chunks), "confidence": 1 - (distances[0] / 1.0) # simple heuristic }
# Usageresult = rag_query("How many days can I work from home per week?")print(result["answer"])# "According to the Remote Work Policy, employees can work from home up to 3 days per week..."print(f"Sources: {result['sources']}")# Sources: ['company_remote_policy.pdf']Chunking: Where RAG Lives or Dies
The most common reason RAG systems fail is bad chunking. Here are the strategies:
Strategy 1: Fixed-Size (Baseline)
RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)- Simple, predictable
- Can cut sentences mid-thought
- Good starting point for homogeneous text
Strategy 2: Sentence-Aware
from langchain.text_splitter import NLTKTextSplittersplitter = NLTKTextSplitter(chunk_size=500)- Respects sentence boundaries
- Better for Q&A on factual documents
- Slightly more setup
Strategy 3: Semantic Chunking (Best Quality)
from langchain_experimental.text_splitter import SemanticChunkerfrom langchain_openai.embeddings import OpenAIEmbeddings
splitter = SemanticChunker( OpenAIEmbeddings(), breakpoint_threshold_type="percentile")- Splits on topic shifts, not arbitrary character counts
- Best retrieval quality
- Slower and more expensive
Which to use?
| Document type | Recommended strategy |
|---|---|
| Policy documents, FAQs | Fixed-size with overlap |
| Legal documents, contracts | Sentence-aware |
| Mixed content (reports, books) | Semantic chunking |
| Code documentation | Fixed-size, larger chunks (1024) |
| Chat history | Message-by-message |
Making Retrieval Better: The Three Upgrades
Once your baseline RAG works, these three upgrades give the biggest quality improvements:
Upgrade 1: Hybrid Search (Dense + Sparse)
Pure vector search misses exact keyword matches. Hybrid search combines:
- Dense search (vector/semantic) — finds conceptually similar chunks
- Sparse search (BM25/keyword) — finds exact term matches
# Using Azure AI Search hybrid searchfrom azure.search.documents import SearchClientfrom azure.search.documents.models import VectorizedQuery
results = search_client.search( search_text=question, # BM25 keyword search vector_queries=[VectorizedQuery( vector=query_embedding, # Dense semantic search k_nearest_neighbors=4, fields="content_vector" )], top=4)When to use it: Always, if your infrastructure supports it. Hybrid search consistently outperforms pure vector search.
Upgrade 2: Reranker
After retrieving top-20 chunks by vector similarity, run a cross-encoder reranker to rescore and reorder them.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, chunks: list[str], top_k: int = 4) -> list[str]: pairs = [(query, chunk) for chunk in chunks] scores = reranker.predict(pairs) ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True) return [chunk for chunk, _ in ranked[:top_k]]
# Usage: retrieve 20, rerank to top 4initial_results = retrieve(query, n=20)final_results = rerank(query, initial_results, top_k=4)Why it helps: Vector similarity finds broadly similar chunks. The reranker asks "is this specific chunk actually useful for answering this specific question?" — a harder, more accurate question.
Upgrade 3: Query Rewriting
Sometimes the user's raw question is not the best search query.
def rewrite_query(user_question: str) -> str: response = llm_client.messages.create( model="claude-sonnet-4-6", max_tokens=100, messages=[{ "role": "user", "content": f""" Rewrite this user question as a search query optimised for finding relevant documents in a corporate knowledge base. Return only the rewritten query, nothing else. Original question: "{user_question}" """ }] ) return response.content[0].text.strip()
# Exampleoriginal = "How many days WFH do I get?"rewritten = rewrite_query(original)# "remote work policy work from home days per week allowance"Evaluating Your RAG System
You cannot improve what you do not measure. These are the four metrics that matter:
# pip install ragas datasetsfrom ragas import evaluatefrom ragas.metrics import faithfulness, answer_relevancy, context_precision
# Build a golden dataset — 20-50 known Q&A pairs with expected answersgolden_dataset = [ { "question": "How many WFH days per week?", "ground_truth": "3 days per week", "answer": rag_query("How many WFH days per week?")["answer"], "contexts": rag_query("How many WFH days per week?")["chunks_used"] }, # ... more examples]
results = evaluate( dataset=golden_dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print(results)# faithfulness: 0.94 (is the answer grounded in the retrieved chunks?)# answer_relevancy: 0.89 (does the answer address the question?)# context_precision: 0.87 (are the retrieved chunks actually relevant?)| Metric | Acceptable | Good | Excellent |
|---|---|---|---|
| Faithfulness | > 0.80 | > 0.90 | > 0.95 |
| Answer relevancy | > 0.75 | > 0.85 | > 0.90 |
| Context precision | > 0.70 | > 0.80 | > 0.90 |
If faithfulness is low: your LLM is ignoring the retrieved chunks and hallucinating — strengthen the system prompt constraints.
If context precision is low: your retrieval is poor — try hybrid search or a reranker.
When RAG Is Not Enough
RAG is not the solution to every problem. Know when it fails:
| Problem | RAG sufficient? | What to do instead |
|---|---|---|
| Summarise 500-page report | No (too large) | Hierarchical chunking + map-reduce |
| Answer from internal docs | Yes | Standard RAG |
| Answer from real-time data | Partial | RAG + live data tool call |
| Write in our brand voice | No | Fine-tune on brand examples |
| Complex multi-doc reasoning | Partial | Agents with RAG tools |
| Personal question ("my orders") | No | API tool call with user context |
The Complete RAG Stack at a Glance
Production RAG Stack:
Documents → [Loader] → [Chunker] → [Embedder] → [Vector DB] ↑User Query → [Query Rewriter] → [Embedder] → [Retriever] → [Reranker] ↓ [LLM] ← [Context Injector] ↓ [Output Validator] ↓ Answer + Sources + Confidence ScoreChoosing Your Vector Database
| Tool | Best for | Pros | Cons |
|---|---|---|---|
| ChromaDB | Local dev / POCs | Zero setup, Python-native | Not production-scale |
| pgvector | Existing PostgreSQL | No new infra | Slower ANN search |
| Pinecone | Managed production | Simple API, scalable | Cost at scale |
| Azure AI Search | Azure-native | Hybrid search built-in, enterprise features | Azure dependency |
| Databricks Vector Search | Databricks | Delta table-backed, auto-sync | Databricks only |
| Weaviate | Complex schemas | GraphQL, multi-tenancy | More setup |
For M&S/Azure environments: Azure AI Search is the natural choice — hybrid search, private endpoints, compliance built-in.
Summary
| Step | What you do | Tool |
|---|---|---|
| Load | Read raw documents | PyPDFLoader, TextLoader |
| Chunk | Split into 512-token pieces with overlap | RecursiveCharacterTextSplitter |
| Embed | Convert to vectors | SentenceTransformer / OpenAI |
| Store | Save to vector database | ChromaDB / pgvector / Azure AI Search |
| Search | Find similar chunks to query | Cosine similarity |
| Rerank | Rescore top chunks | CrossEncoder |
| Generate | LLM answers with context | Claude / GPT-4o |
| Evaluate | Measure faithfulness + relevancy | RAGAS |
Next: Part 5 — AI Agents: Making AI Take Actions
In Part 5, we go beyond Q&A. Agents can call APIs, run code, search the web, and make multi-step decisions. You will build one from scratch.