From Data Engineer to AI Engineer — Part 2: How LLMs Actually Work

Series: From Data/Software Engineer to AI Engineer Part 2 of 7 — ← Part 1: The Mindset Shift

Why You Need to Understand This

You do not need to implement a transformer from scratch. But if you treat an LLM as a magic black box, you will:

Write bad prompts without knowing why they fail
Not know when to use RAG vs fine-tuning
Struggle to debug unexpected model behaviour
Sound junior in technical interviews

This post gives you the mental model that makes everything else click. No maths. Just clear concepts.

Step 1: Tokenisation — The Model Does Not See Words

The first thing that surprises most engineers: LLMs do not read words. They read tokens.

A token is roughly 4 characters of English text. Words get split into sub-word pieces:

"Hello"          → ["Hello"]           (1 token)
"engineering"    → ["engineer", "ing"] (2 tokens)
"ChatGPT"        → ["Chat", "G", "PT"] (3 tokens)
"£500K+"         → ["£", "500", "K", "+"] (4 tokens)

Why this matters to you:

Pricing is per token (input + output tokens = cost)
Long documents eat your context window fast
Non-English text is less efficient (Japanese/Arabic ≈ 1 char per token)
Code is tokenised differently than prose — snake_case might be 3 tokens

# See tokenisation yourself
import tiktoken  # OpenAI's tokeniser (similar for other models)
enc = tiktoken.encoding_for_model("gpt-4")

text = "I am a data engineer building AI systems"
tokens = enc.encode(text)
print(f"Words: {len(text.split())}")   # 8 words
print(f"Tokens: {len(tokens)}")         # ~9 tokens
print(f"Token ids: {tokens}")

Step 2: Embeddings — Turning Words into Meaning

After tokenisation, each token is converted into an embedding — a list of numbers (a vector).

This is the key idea: similar meanings produce similar vectors.

"king"   → [0.2, 0.8, -0.1, 0.5, ...]  (1,536 numbers)
"queen"  → [0.2, 0.7, -0.1, 0.6, ...]  (very similar)
"table"  → [-0.4, 0.1, 0.9, -0.3, ...] (very different)

The famous example: king - man + woman ≈ queen

This is not magic — it is a consequence of training on massive text where "king" and "queen" appear in similar contexts. The model learned that their meanings are related.

The practical implication: Embeddings let you do semantic search.

# Semantic search — find meaning, not just keywords
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

documents = [
    "How to process customer orders in bulk",
    "Batch processing techniques for high-volume systems",
    "My cat enjoys sleeping in the sun",
]

query = "large scale order handling"

doc_embeddings = model.encode(documents)
query_embedding = model.encode([query])

# Cosine similarity — 1.0 = same meaning, 0.0 = unrelated
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
best = np.argmax(similarities)

print(f"Best match: {documents[best]}")
# "Batch processing techniques for high-volume systems"
# Note: no keyword overlap with the query, but semantically the same idea

This is how RAG works — your question becomes a vector, and you find document chunks that have similar vectors.

Step 3: The Transformer — Where the Intelligence Lives

The transformer is the architecture that made modern LLMs possible. You need to understand two components:

Attention: "Which other words matter right now?"

Imagine you are reading this sentence:

"The bank rejected the loan because it was too risky."

What does "it" refer to? The loan, not the bank. You know this because the word "risky" has a stronger relationship to "loan" than to "bank" in this financial context.

Attention does exactly this — for every word, it calculates how much each other word should influence its meaning.

Visually:

"The bank rejected the loan because it was too risky"
                                  ↑
               it attends strongly to "loan" (0.85)
               it attends weakly to "bank" (0.12)
               it barely attends to "The" (0.01)

This computation happens for every token, looking at every other token, in every layer of the model. A large model has 96 layers. That is why inference is compute-heavy.

Multi-head attention means running this attention process multiple times in parallel, where each "head" learns to look for different types of relationships — one might focus on grammatical structure, another on semantic similarity, another on coreference (like our "it" example).

Feed-Forward Network: "Where the knowledge lives"

After attention, each token passes through a feed-forward neural network (a simple 2-layer MLP). This is where the model's stored knowledge lives — the facts it learned during training.

Think of it this way:

Attention = understanding relationships and context
Feed-forward = remembering facts

The model knows that "Paris is the capital of France" because this fact appeared millions of times in training data and got encoded into the feed-forward weights.

Step 4: Generation — How It Produces Text

Now you understand how the model processes input. Here is how it produces output:

The autoregressive loop:

Input: "The capital of France is"
↓
Model processes all tokens with attention
↓
Feed-forward layers retrieve relevant knowledge
↓
Output layer produces a probability distribution over all 50,000+ tokens:
  "Paris"   → 94.2%
  "Paris."  → 3.1%
  "Lyon"    → 0.8%
  "Berlin"  → 0.3%
  ...
↓
Sample from this distribution → pick "Paris"
↓
Append "Paris" to input, repeat
↓
Input: "The capital of France is Paris"
↓
Continue until end-of-sequence token or max tokens reached

This is autoregressive generation — the model generates one token at a time, each token becoming part of the next input.

Temperature controls this sampling:

# Temperature = 0: always pick the highest probability token
# Deterministic, but can sound robotic

# Temperature = 0.7: sample from top probabilities
# Natural and varied — good for most use cases

# Temperature = 1.5+: sample from wider distribution
# Creative but risky — more hallucinations

import anthropic
client = anthropic.Anthropic()

# Conservative (factual tasks)
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=100,
    temperature=0.1,  # near-deterministic
    messages=[{"role": "user", "content": "What is 2+2?"}]
)

# Creative (brainstorming tasks)
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=200,
    temperature=0.9,  # more varied
    messages=[{"role": "user", "content": "Give me 5 product name ideas for an AI assistant"}]
)

Step 5: The Context Window — Working Memory

Every LLM has a context window — the maximum number of tokens it can "see" at once.

Think of it as the model's working memory. Whatever fits in the context window is what the model can reason about. Everything outside it is invisible.

Model	Context window	Approx. pages of text
GPT-4o	128,000 tokens	~300 pages
Claude Sonnet	200,000 tokens	~450 pages
Llama 3.1 70B	128,000 tokens	~300 pages
GPT-3.5	16,000 tokens	~40 pages

The practical implication:

If your document is 500 pages, it does not fit in the context window. You cannot just paste it all in. This is why RAG exists — you extract the relevant 2–3 pages before calling the LLM.

# Naive approach (fails for large docs)
with open("company_policy_500pages.txt") as f:
    document = f.read()

response = llm.call(f"{document}\n\nQ: What is the refund policy?")
# Error: context length exceeded

# RAG approach (correct)
relevant_chunks = vector_search("refund policy")  # finds the 3 relevant pages
response = llm.call(f"{relevant_chunks}\n\nQ: What is the refund policy?")
# Works — and the answer is grounded in actual document content

The Complete Picture in One Diagram

User types: "What is RAG?"
     ↓
[Tokeniser]    "What" "is" "R" "AG" "?"  (5 tokens)
     ↓
[Embedding]    Each token → vector (1536 numbers)
     ↓
[Attention]    Each token looks at all others, builds context
     ↓
[FFN]          Retrieves knowledge from weights: "RAG = Retrieval Augmented Generation..."
     ↓
[x96 layers]   Refines understanding through 96 transformer blocks
     ↓
[Output layer] Probability over 50k tokens → sample → "Retrieval"
     ↓
     → append "Retrieval" → loop → "Augmented" → loop → "Generation"...
     ↓
Final answer: "RAG (Retrieval-Augmented Generation) is a technique that..."

Five Things This Explains

Now that you understand how LLMs work, several "mysteries" resolve:

1. Why LLMs hallucinate: The model always produces the most probable next token. If the correct answer was not well-represented in training data, it will still produce something plausible-sounding rather than saying "I don't know."

2. Why longer prompts are not always better: More tokens in = more computation. And if your prompt is full of irrelevant information, attention gets diluted across noise.

3. Why the same prompt gives different answers: Temperature introduces randomness in sampling. Even at temperature=0, you can get different results across model versions.

4. Why "tell me step by step" often helps: Chain-of-thought prompting forces the model to produce intermediate reasoning tokens, which become context for the final answer — essentially extending its working memory.

5. Why models know things up to a date: The knowledge is frozen in the weights at training time. Events after the training cutoff are invisible unless you inject them via RAG.

What You Can Now Do

With this mental model:

You understand why RAG works (semantic vector search + context injection)
You understand temperature (sampling from probability distributions)
You understand context window limits (and why they require workarounds)
You can explain hallucinations accurately
You know what "embeddings" are and why they enable semantic search

Quick Reference Card

Concept	One-line explanation	Practical impact
Token	~4 chars of text, the unit of processing	Affects cost and context limits
Embedding	Vector representation of meaning	Powers semantic search and RAG
Attention	Each token weighs relevance to all others	Enables understanding of context
Context window	Max tokens model can see at once	Why RAG is necessary for large docs
Temperature	Sampling randomness	0 = deterministic, 1 = creative
Hallucination	Model produces plausible but wrong output	Managed with RAG + confidence scoring

Next: Part 3 — Prompt Engineering That Actually Works

In Part 3, we get hands-on. You will learn the prompt engineering techniques that separate junior from senior AI engineers — with real before/after examples.