From Data Engineer to AI Engineer — Part 7: AI Governance — Building AI You Can Trust
- Published on
- -14 mins read
- Authors
- Name
- Vijay Anand Pandian
- @vijayanandrp
Series: From Data/Software Engineer to AI Engineer Part 7 of 7 — ← Part 6: Production AI
Why Governance Is an Engineering Problem
"AI governance" sounds like something for the compliance team. It is not.
Every technical decision you make is a governance decision:
- How you handle user data in prompts
- Whether you log model outputs for audit
- How you handle low-confidence responses
- Whether you let the model make autonomous decisions
The engineer who says "governance isn't my job" is the one who gets called when the model leaks a customer's personal data or gives a confident, wrong answer that costs the business money.
This is not abstract. It is concrete code you write today.
The Five Governance Pillars (With Code)
Pillar 1: PII Protection — Never Let Personal Data Into the LLM Unmasked
The problem: users ask questions that contain personal information. If you log those queries, you have a data protection issue. If you send them to a third-party LLM API, you may violate GDPR.
The solution: mask before it ever leaves your system.
import refrom dataclasses import dataclass
@dataclassclass MaskingResult: masked_text: str pii_found: bool types_found: list[str]
# UK-specific PII patternsPII_PATTERNS = { "EMAIL": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", "UK_PHONE": r"(\+44\s?|0)(\d\s?){9,10}", "UK_NI_NUMBER": r"[A-Z]{2}\s?\d{2}\s?\d{2}\s?\d{2}\s?[A-D]", "UK_POSTCODE": r"[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}", "CARD_NUMBER": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b", "SORT_CODE": r"\d{2}-\d{2}-\d{2}", "DATE_OF_BIRTH": r"\b\d{1,2}[\/\-]\d{1,2}[\/\-]\d{4}\b",}
def mask_pii(text: str) -> MaskingResult: masked = text found_types = [] for pii_type, pattern in PII_PATTERNS.items(): if re.search(pattern, masked, re.IGNORECASE): found_types.append(pii_type) masked = re.sub(pattern, f"[{pii_type}]", masked, flags=re.IGNORECASE) return MaskingResult( masked_text=masked, pii_found=len(found_types) > 0, types_found=found_types )
# Usage — always mask before sending to LLM or logginguser_input = "My email is john.smith@company.com and NI number is AB123456C"result = mask_pii(user_input)
print(result.masked_text)# "My email is [EMAIL] and NI number is [UK_NI_NUMBER]"print(result.pii_found) # Trueprint(result.types_found) # ["EMAIL", "UK_NI_NUMBER"]
# In your query pipeline:def safe_query(user_question: str, user_id: str) -> dict: masking = mask_pii(user_question) if masking.pii_found: # Log the incident — someone submitted PII audit_log({ "event": "pii_detected", "user_id": user_id, "pii_types": masking.types_found, # Note: log that PII was found, NOT the actual PII }) # Always use masked version for the LLM call return rag_query(masking.masked_text)For production: consider Microsoft Presidio — a more comprehensive, ML-backed PII detection library that handles names, addresses, and more nuanced PII.
# pip install presidio-analyzer presidio-anonymizerfrom presidio_analyzer import AnalyzerEnginefrom presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()anonymizer = AnonymizerEngine()
def mask_pii_presidio(text: str) -> str: results = analyzer.analyze(text=text, language="en") anonymized = anonymizer.anonymize(text=text, analyzer_results=results) return anonymized.text
# More accurate than regex for names and complex PIImask_pii_presidio("John Smith from London called us on 07700 900123")# "<PERSON> from <LOCATION> called us on <PHONE_NUMBER>"Pillar 2: Audit Logging — Every Inference Is a Record
Every time your AI system produces an output, that output can be:
- Incorrect and acted upon (business risk)
- Evidence in a dispute (legal risk)
- Data for improving the system (operational value)
You need an immutable log of every inference.
import jsonimport uuidfrom datetime import datetime, timezonefrom pathlib import Path
class AuditLogger: def __init__(self, log_file: str = "audit.jsonl"): self.log_file = Path(log_file) def log( self, user_id: str, session_id: str, raw_question: str, # NEVER store raw if it contains PII masked_question: str, # Store this instead answer: str, sources: list[str], confidence: float, model: str, prompt_version: str, input_tokens: int, output_tokens: int, latency_ms: int, pii_detected: bool, ) -> str: request_id = str(uuid.uuid4()) record = { "request_id": request_id, "timestamp": datetime.now(timezone.utc).isoformat(), "user_id": user_id, "session_id": session_id, # PII-safe: log masked question, not raw "question": masked_question, "answer_length": len(answer), # length, not content, for privacy "answer_preview": answer[:100] + "..." if len(answer) > 100 else answer, "sources": sources, "confidence": round(confidence, 3), "model": model, "prompt_version": prompt_version, "input_tokens": input_tokens, "output_tokens": output_tokens, "cost_usd": round((input_tokens * 0.000003) + (output_tokens * 0.000015), 6), "latency_ms": latency_ms, "pii_detected": pii_detected, } # JSONL format — one record per line, easy to parse and stream with open(self.log_file, "a", encoding="utf-8") as f: f.write(json.dumps(record) + "\n") return request_id
audit = AuditLogger("logs/audit.jsonl")
# After every inference:request_id = audit.log( user_id="user_123", session_id="sess_456", raw_question=user_question, # used for masking check only, not stored masked_question=masking.masked_text, answer=result["answer"], sources=result["sources"], confidence=result["confidence"], model="claude-sonnet-4-6", prompt_version="v2.3", input_tokens=response.usage.input_tokens, output_tokens=response.usage.output_tokens, latency_ms=latency, pii_detected=masking.pii_found)Where to store audit logs:
- Development: local JSONL file
- Production: Azure Blob Storage, S3, or a dedicated audit database
- Retention: check your legal requirements — financial services may require 7+ years
Pillar 3: Confidence Scoring and Graceful Fallbacks
When your AI system is not confident enough to answer, it should say so — not guess.
from dataclasses import dataclass
@dataclassclass ConfidenceResult: score: float # 0.0 to 1.0 level: str # "high", "medium", "low" should_fallback: bool disclaimer: str | None
UNCERTAINTY_SIGNALS = [ "i don't know", "i'm not sure", "unclear", "uncertain", "cannot confirm", "might be", "possibly", "i think", "not certain", "may not be accurate"]
def score_confidence( answer: str, retrieved_chunks: list[str], retrieval_distances: list[float]) -> ConfidenceResult: score = 1.0 # Penalty 1: uncertainty language in the answer answer_lower = answer.lower() uncertainty_count = sum(1 for signal in UNCERTAINTY_SIGNALS if signal in answer_lower) score -= uncertainty_count * 0.15 # Penalty 2: poor retrieval (high cosine distance = low similarity) if retrieval_distances: avg_distance = sum(retrieval_distances) / len(retrieval_distances) if avg_distance > 0.3: # poor similarity score -= 0.25 elif avg_distance > 0.2: score -= 0.10 # Penalty 3: no sources found if not retrieved_chunks: score = 0.0 # Bonus: answer grounded in retrieved text if retrieved_chunks: answer_words = set(answer_lower.split()) grounding = sum( len(answer_words & set(chunk.lower().split())) for chunk in retrieved_chunks ) / (len(answer_words) + 1) score = min(1.0, score + grounding * 0.2) score = max(0.0, min(1.0, score)) # Classify level and determine action if score >= 0.8: return ConfidenceResult(score=score, level="high", should_fallback=False, disclaimer=None) elif score >= 0.6: return ConfidenceResult( score=score, level="medium", should_fallback=False, disclaimer="⚠️ Moderate confidence. Please verify with the source documents." ) else: return ConfidenceResult( score=score, level="low", should_fallback=True, disclaimer=None # We won't show the low-confidence answer at all )
# In your query pipeline:confidence = score_confidence(answer, chunks, distances)
if confidence.should_fallback: return { "answer": ( "I don't have enough reliable information to answer this confidently. " "Please contact the relevant team directly, or check the source documents." ), "sources": [], "confidence": confidence.score, "fallback": True }
response = { "answer": answer, "sources": sources, "confidence": confidence.score,}
if confidence.disclaimer: response["disclaimer"] = confidence.disclaimer
return responsePillar 4: Content Safety — Input and Output Guardrails
You need to protect against:
- Prompt injection: users trying to hijack the system prompt
- Harmful outputs: the model producing dangerous or inappropriate content
- Jailbreaks: users trying to bypass your constraints
# Input guardrailsINJECTION_PATTERNS = [ r"ignore previous instructions", r"forget your system prompt", r"you are now", r"act as", r"new instruction", r"override.*rules",]
def check_prompt_injection(text: str) -> bool: """Returns True if injection attempt detected.""" text_lower = text.lower() return any(re.search(pattern, text_lower) for pattern in INJECTION_PATTERNS)
# Output guardrailsdef validate_output(answer: str) -> tuple[bool, str | None]: """ Returns (is_safe, rejection_reason). Extend with your specific business rules. """ # Too short to be a real answer if len(answer.strip()) < 10: return False, "Response too short" # Contains refusal markers (model declined to answer) refusal_markers = ["i cannot", "i'm unable to", "i won't", "as an ai"] if any(marker in answer.lower() for marker in refusal_markers): return False, "Model declined" # Check for PII in output (model should never output personal data) output_masking = mask_pii(answer) if output_masking.pii_found: return False, f"Output contained PII: {output_masking.types_found}" return True, None
# Usageif check_prompt_injection(user_question): audit_log({"event": "injection_attempt", "user_id": user_id}) return {"answer": "I cannot process that request.", "sources": [], "confidence": 0}
is_safe, rejection_reason = validate_output(answer)if not is_safe: logger.warning(f"Output rejected: {rejection_reason}") return {"answer": "Unable to generate a safe response.", "sources": [], "confidence": 0}Pillar 5: The Model Card — Documenting What You Built
A model card is a one-page document that says: here is what this system does, what it is good at, what it gets wrong, and who is responsible for it.
It is the difference between shipping an AI system and shipping a responsible AI system.
# Model Card: Internal Policy Assistant v2.3
## Overview- **Purpose**: Answer questions about M&S HR and operational policies- **Model**: claude-sonnet-4-6 (Anthropic) via Azure OpenAI proxy- **Retrieval**: Azure AI Search (hybrid dense + sparse)- **Deployed**: April 2026 | **Owner**: AI Engineering Team
## Intended Use- ✅ Answering questions about company policies- ✅ Summarising policy documents- ❌ Making HR decisions (disciplinary, pay, leave approval)- ❌ Legal or medical advice- ❌ Questions outside the indexed document set
## Performance (on golden dataset, April 2026)| Metric | Score | Threshold ||--------|-------|-----------|| Faithfulness | 0.94 | > 0.85 || Answer relevancy | 0.91 | > 0.80 || Context precision | 0.88 | > 0.75 || Avg latency | 1.2s | < 3.0s |
## Known Failure Modes1. **Multi-policy questions**: questions spanning 3+ policies see 15% lower accuracy2. **Ambiguous acronyms**: "WFH" correctly resolved, "TC" ambiguous (time-off-in-lieu vs technical change)3. **Very recent policy updates**: new documents take 24h to appear in the index
## Governance- PII masking: applied to all inputs before LLM call (regex + Presidio)- Audit logging: every inference logged to Azure Blob (90-day retention)- Confidence threshold: responses below 0.6 score trigger fallback message- Human escalation: "Contact HR" path always available in UI- Data residency: all processing within UK Azure regions
## Responsible AI Considerations- No personal data stored in vector index- No autonomous decisions — system is advisory only- Output includes source citations for verification- Users informed they are interacting with an AI assistant
## ContactAI Engineering: vijay.anand@company.comHR System Owner: hr-systems@company.comUnderstanding the EU AI Act
If you are building AI systems in the UK/EU, you need to know this.
The EU AI Act (in force August 2024) classifies AI systems by risk:
UNACCEPTABLE RISK → Banned Examples: social scoring systems, real-time biometric surveillance in public Action: Do not build these
HIGH RISK → Strict requirements Examples: AI in hiring, credit scoring, healthcare, law enforcement Requirements: - Conformity assessment - Human oversight mandatory - Audit logs (10-year retention) - Accuracy, robustness, cybersecurity requirements - Transparency to affected individuals - Register in EU AI database
LIMITED RISK → Transparency obligations Examples: chatbots, deepfakes Requirements: - Must disclose that users are interacting with AI - Label AI-generated content
MINIMAL RISK → No specific requirements Examples: spam filters, AI in video games Action: Good practice still recommendedClassifying common AI systems:
| System | Risk Tier | Key requirement |
|---|---|---|
| HR CV screening tool | High risk | Human reviews all decisions, audit trail |
| Customer chatbot | Limited risk | "You are chatting with an AI" disclosure |
| Internal document Q&A | Minimal risk | Good practice: confidence scores, fallback |
| Credit scoring | High risk | Explainability, bias testing, human override |
| Product recommendations | Minimal risk | No specific requirements |
| Medical diagnosis support | High risk | CE marking equivalent, clinical validation |
Bias Testing — The Step Most Teams Skip
An AI system that performs well on average can still systematically fail for specific groups. You need to test for this.
def bias_audit(test_cases: list[dict]) -> dict: """ Test whether the AI performs consistently across demographic groups. test_cases: [{"question": "...", "group": "group_A", "expected": "..."}] """ results_by_group = {} for case in test_cases: answer = rag_query(case["question"])["answer"] correct = case["expected"].lower() in answer.lower() group = case["group"] if group not in results_by_group: results_by_group[group] = {"correct": 0, "total": 0} results_by_group[group]["total"] += 1 if correct: results_by_group[group]["correct"] += 1 # Calculate accuracy per group accuracy_by_group = { group: data["correct"] / data["total"] for group, data in results_by_group.items() } # Alert if any group is significantly worse than average avg_accuracy = sum(accuracy_by_group.values()) / len(accuracy_by_group) for group, accuracy in accuracy_by_group.items(): if accuracy < avg_accuracy - 0.10: # >10% below average print(f"⚠️ Potential bias: {group} accuracy {accuracy:.2f} vs avg {avg_accuracy:.2f}") return accuracy_by_group
# Run quarterlybias_results = bias_audit(bias_test_dataset)The Governance Checklist
Before any AI system goes to production:
Data Protection ☐ PII masking on all inputs before LLM call ☐ PII validation on all outputs ☐ No personal data in vector index ☐ Data residency requirements met
Audit Trail ☐ Every inference logged (masked question, answer, confidence, model, user_id) ☐ Retention period defined and enforced ☐ Logs accessible for incident investigation
Quality & Safety ☐ Confidence scoring with fallback behaviour ☐ Prompt injection detection ☐ Output content validation ☐ Graceful failure modes tested
Documentation ☐ Model card written and approved ☐ Intended use and limitations documented ☐ Known failure modes documented ☐ Human escalation path defined
Compliance ☐ EU/UK AI Act risk tier assessed ☐ If high-risk: conformity assessment completed ☐ User disclosure ("you are interacting with AI") in place if required ☐ Bias audit completed on representative test cases
Ongoing ☐ Weekly quality evaluation on golden dataset ☐ Quarterly bias audit ☐ Incident response process defined ☐ Model card review cadence setPutting It All Together
Here is the full governed query pipeline — the code that runs on every single inference:
def governed_query(user_question: str, user_id: str, session_id: str) -> dict: start = time.time() request_id = str(uuid.uuid4()) # 1. Input validation if len(user_question) > 2000: return {"error": "Question too long", "request_id": request_id} # 2. Injection detection if check_prompt_injection(user_question): audit_log({"event": "injection_attempt", "user_id": user_id, "request_id": request_id}) return {"answer": "I cannot process that request.", "request_id": request_id} # 3. PII masking masking = mask_pii(user_question) # 4. RAG query (using masked text) result = rag_query(masking.masked_text) # 5. Output validation is_safe, rejection_reason = validate_output(result["answer"]) if not is_safe: logger.warning(f"Output rejected [{request_id}]: {rejection_reason}") result["answer"] = "Unable to generate a safe response. Please contact support." result["confidence"] = 0.0 # 6. Confidence scoring confidence = score_confidence(result["answer"], result.get("chunks", []), result.get("distances", [])) if confidence.should_fallback: result["answer"] = ( "I don't have enough reliable information to answer this question confidently. " "Please check the source documents directly or contact your team." ) result["confidence"] = confidence.score result["fallback"] = True if confidence.disclaimer: result["disclaimer"] = confidence.disclaimer # 7. Audit log (everything, with masking) latency_ms = int((time.time() - start) * 1000) audit.log( user_id=user_id, session_id=session_id, raw_question=user_question, # used for masking only masked_question=masking.masked_text, answer=result["answer"], sources=result.get("sources", []), confidence=result.get("confidence", 0.0), model="claude-sonnet-4-6", prompt_version="v2.3", input_tokens=result.get("input_tokens", 0), output_tokens=result.get("output_tokens", 0), latency_ms=latency_ms, pii_detected=masking.pii_found, ) result["request_id"] = request_id result["latency_ms"] = latency_ms return resultSeries Wrap-Up
You have completed the full journey from data/software engineer to AI engineer.
Here is what you now know how to do:
| Part | You can now... |
|---|---|
| 1 — Mindset | Explain why AI systems require different engineering instincts |
| 2 — LLMs | Explain how a language model works, what tokens/embeddings/attention are |
| 3 — Prompts | Write production-grade prompts using 6 core techniques |
| 4 — RAG | Build a complete RAG pipeline with evaluation |
| 5 — Agents | Build an agent with tools using the ReAct pattern |
| 6 — Production | Deploy, monitor, and control costs for an AI system |
| 7 — Governance | Implement PII masking, audit logging, confidence scoring, and model cards |
The AI engineer who implements all seven parts is not just building AI. They are building AI responsibly — and that is what companies pay senior rates for.
What Next?
Now that you have the foundations:
- Build something real — pick one use case from your current work and build it end-to-end using this series as a reference
- Get the Databricks GenAI Associate cert — validates your RAG and LLM knowledge formally
- Study LangGraph — for production-grade stateful agent workflows
- Read the EU AI Act summary — one hour, essential for any enterprise AI role
- Write your own model card — for something you have already built
The gap between knowing these concepts and building with them closes the moment you ship something real.
Written by Vijay Anand Pandian — AI Tech Lead & Senior Data Engineer at M&S Sparks. Building governed AI systems that bridge business and engineering in London, UK.