From Data Engineer to AI Engineer — Part 2: How LLMs Actually Work
- Published on
- -9 mins read
- Authors
- Name
- Vijay Anand Pandian
- @vijayanandrp
Series: From Data/Software Engineer to AI Engineer Part 2 of 7 — ← Part 1: The Mindset Shift
Why You Need to Understand This
You do not need to implement a transformer from scratch. But if you treat an LLM as a magic black box, you will:
- Write bad prompts without knowing why they fail
- Not know when to use RAG vs fine-tuning
- Struggle to debug unexpected model behaviour
- Sound junior in technical interviews
This post gives you the mental model that makes everything else click. No maths. Just clear concepts.
Step 1: Tokenisation — The Model Does Not See Words
The first thing that surprises most engineers: LLMs do not read words. They read tokens.
A token is roughly 4 characters of English text. Words get split into sub-word pieces:
"Hello" → ["Hello"] (1 token)"engineering" → ["engineer", "ing"] (2 tokens)"ChatGPT" → ["Chat", "G", "PT"] (3 tokens)"£500K+" → ["£", "500", "K", "+"] (4 tokens)Why this matters to you:
- Pricing is per token (input + output tokens = cost)
- Long documents eat your context window fast
- Non-English text is less efficient (Japanese/Arabic ≈ 1 char per token)
- Code is tokenised differently than prose —
snake_casemight be 3 tokens
# See tokenisation yourselfimport tiktoken # OpenAI's tokeniser (similar for other models)enc = tiktoken.encoding_for_model("gpt-4")
text = "I am a data engineer building AI systems"tokens = enc.encode(text)print(f"Words: {len(text.split())}") # 8 wordsprint(f"Tokens: {len(tokens)}") # ~9 tokensprint(f"Token ids: {tokens}")Step 2: Embeddings — Turning Words into Meaning
After tokenisation, each token is converted into an embedding — a list of numbers (a vector).
This is the key idea: similar meanings produce similar vectors.
"king" → [0.2, 0.8, -0.1, 0.5, ...] (1,536 numbers)"queen" → [0.2, 0.7, -0.1, 0.6, ...] (very similar)"table" → [-0.4, 0.1, 0.9, -0.3, ...] (very different)The famous example: king - man + woman ≈ queen
This is not magic — it is a consequence of training on massive text where "king" and "queen" appear in similar contexts. The model learned that their meanings are related.
The practical implication: Embeddings let you do semantic search.
# Semantic search — find meaning, not just keywordsfrom sentence_transformers import SentenceTransformerimport numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
documents = [ "How to process customer orders in bulk", "Batch processing techniques for high-volume systems", "My cat enjoys sleeping in the sun",]
query = "large scale order handling"
doc_embeddings = model.encode(documents)query_embedding = model.encode([query])
# Cosine similarity — 1.0 = same meaning, 0.0 = unrelatedsimilarities = np.dot(doc_embeddings, query_embedding.T).flatten()best = np.argmax(similarities)
print(f"Best match: {documents[best]}")# "Batch processing techniques for high-volume systems"# Note: no keyword overlap with the query, but semantically the same ideaThis is how RAG works — your question becomes a vector, and you find document chunks that have similar vectors.
Step 3: The Transformer — Where the Intelligence Lives
The transformer is the architecture that made modern LLMs possible. You need to understand two components:
Attention: "Which other words matter right now?"
Imagine you are reading this sentence:
"The bank rejected the loan because it was too risky."
What does "it" refer to? The loan, not the bank. You know this because the word "risky" has a stronger relationship to "loan" than to "bank" in this financial context.
Attention does exactly this — for every word, it calculates how much each other word should influence its meaning.
Visually:
"The bank rejected the loan because it was too risky" ↑ it attends strongly to "loan" (0.85) it attends weakly to "bank" (0.12) it barely attends to "The" (0.01)This computation happens for every token, looking at every other token, in every layer of the model. A large model has 96 layers. That is why inference is compute-heavy.
Multi-head attention means running this attention process multiple times in parallel, where each "head" learns to look for different types of relationships — one might focus on grammatical structure, another on semantic similarity, another on coreference (like our "it" example).
Feed-Forward Network: "Where the knowledge lives"
After attention, each token passes through a feed-forward neural network (a simple 2-layer MLP). This is where the model's stored knowledge lives — the facts it learned during training.
Think of it this way:
- Attention = understanding relationships and context
- Feed-forward = remembering facts
The model knows that "Paris is the capital of France" because this fact appeared millions of times in training data and got encoded into the feed-forward weights.
Step 4: Generation — How It Produces Text
Now you understand how the model processes input. Here is how it produces output:
The autoregressive loop:
Input: "The capital of France is"↓Model processes all tokens with attention↓Feed-forward layers retrieve relevant knowledge↓Output layer produces a probability distribution over all 50,000+ tokens: "Paris" → 94.2% "Paris." → 3.1% "Lyon" → 0.8% "Berlin" → 0.3% ...↓Sample from this distribution → pick "Paris"↓Append "Paris" to input, repeat↓Input: "The capital of France is Paris"↓Continue until end-of-sequence token or max tokens reachedThis is autoregressive generation — the model generates one token at a time, each token becoming part of the next input.
Temperature controls this sampling:
# Temperature = 0: always pick the highest probability token# Deterministic, but can sound robotic
# Temperature = 0.7: sample from top probabilities# Natural and varied — good for most use cases
# Temperature = 1.5+: sample from wider distribution# Creative but risky — more hallucinations
import anthropicclient = anthropic.Anthropic()
# Conservative (factual tasks)response = client.messages.create( model="claude-sonnet-4-6", max_tokens=100, temperature=0.1, # near-deterministic messages=[{"role": "user", "content": "What is 2+2?"}])
# Creative (brainstorming tasks)response = client.messages.create( model="claude-sonnet-4-6", max_tokens=200, temperature=0.9, # more varied messages=[{"role": "user", "content": "Give me 5 product name ideas for an AI assistant"}])Step 5: The Context Window — Working Memory
Every LLM has a context window — the maximum number of tokens it can "see" at once.
Think of it as the model's working memory. Whatever fits in the context window is what the model can reason about. Everything outside it is invisible.
| Model | Context window | Approx. pages of text |
|---|---|---|
| GPT-4o | 128,000 tokens | ~300 pages |
| Claude Sonnet | 200,000 tokens | ~450 pages |
| Llama 3.1 70B | 128,000 tokens | ~300 pages |
| GPT-3.5 | 16,000 tokens | ~40 pages |
The practical implication:
If your document is 500 pages, it does not fit in the context window. You cannot just paste it all in. This is why RAG exists — you extract the relevant 2–3 pages before calling the LLM.
# Naive approach (fails for large docs)with open("company_policy_500pages.txt") as f: document = f.read()
response = llm.call(f"{document}\n\nQ: What is the refund policy?")# Error: context length exceeded
# RAG approach (correct)relevant_chunks = vector_search("refund policy") # finds the 3 relevant pagesresponse = llm.call(f"{relevant_chunks}\n\nQ: What is the refund policy?")# Works — and the answer is grounded in actual document contentThe Complete Picture in One Diagram
User types: "What is RAG?" ↓[Tokeniser] "What" "is" "R" "AG" "?" (5 tokens) ↓[Embedding] Each token → vector (1536 numbers) ↓[Attention] Each token looks at all others, builds context ↓[FFN] Retrieves knowledge from weights: "RAG = Retrieval Augmented Generation..." ↓[x96 layers] Refines understanding through 96 transformer blocks ↓[Output layer] Probability over 50k tokens → sample → "Retrieval" ↓ → append "Retrieval" → loop → "Augmented" → loop → "Generation"... ↓Final answer: "RAG (Retrieval-Augmented Generation) is a technique that..."Five Things This Explains
Now that you understand how LLMs work, several "mysteries" resolve:
1. Why LLMs hallucinate: The model always produces the most probable next token. If the correct answer was not well-represented in training data, it will still produce something plausible-sounding rather than saying "I don't know."
2. Why longer prompts are not always better: More tokens in = more computation. And if your prompt is full of irrelevant information, attention gets diluted across noise.
3. Why the same prompt gives different answers: Temperature introduces randomness in sampling. Even at temperature=0, you can get different results across model versions.
4. Why "tell me step by step" often helps: Chain-of-thought prompting forces the model to produce intermediate reasoning tokens, which become context for the final answer — essentially extending its working memory.
5. Why models know things up to a date: The knowledge is frozen in the weights at training time. Events after the training cutoff are invisible unless you inject them via RAG.
What You Can Now Do
With this mental model:
- You understand why RAG works (semantic vector search + context injection)
- You understand temperature (sampling from probability distributions)
- You understand context window limits (and why they require workarounds)
- You can explain hallucinations accurately
- You know what "embeddings" are and why they enable semantic search
Quick Reference Card
| Concept | One-line explanation | Practical impact |
|---|---|---|
| Token | ~4 chars of text, the unit of processing | Affects cost and context limits |
| Embedding | Vector representation of meaning | Powers semantic search and RAG |
| Attention | Each token weighs relevance to all others | Enables understanding of context |
| Context window | Max tokens model can see at once | Why RAG is necessary for large docs |
| Temperature | Sampling randomness | 0 = deterministic, 1 = creative |
| Hallucination | Model produces plausible but wrong output | Managed with RAG + confidence scoring |
Next: Part 3 — Prompt Engineering That Actually Works
In Part 3, we get hands-on. You will learn the prompt engineering techniques that separate junior from senior AI engineers — with real before/after examples.