From Data Engineer to AI Engineer — Part 7: AI Governance — Building AI You Can Trust

Published on
-
14 mins read
Authors

Series: From Data/Software Engineer to AI Engineer Part 7 of 7 — ← Part 6: Production AI


Why Governance Is an Engineering Problem

"AI governance" sounds like something for the compliance team. It is not.

Every technical decision you make is a governance decision:

  • How you handle user data in prompts
  • Whether you log model outputs for audit
  • How you handle low-confidence responses
  • Whether you let the model make autonomous decisions

The engineer who says "governance isn't my job" is the one who gets called when the model leaks a customer's personal data or gives a confident, wrong answer that costs the business money.

This is not abstract. It is concrete code you write today.


The Five Governance Pillars (With Code)

Pillar 1: PII Protection — Never Let Personal Data Into the LLM Unmasked

The problem: users ask questions that contain personal information. If you log those queries, you have a data protection issue. If you send them to a third-party LLM API, you may violate GDPR.

The solution: mask before it ever leaves your system.

import re
from dataclasses import dataclass
@dataclass
class MaskingResult:
masked_text: str
pii_found: bool
types_found: list[str]
# UK-specific PII patterns
PII_PATTERNS = {
"EMAIL": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
"UK_PHONE": r"(\+44\s?|0)(\d\s?){9,10}",
"UK_NI_NUMBER": r"[A-Z]{2}\s?\d{2}\s?\d{2}\s?\d{2}\s?[A-D]",
"UK_POSTCODE": r"[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}",
"CARD_NUMBER": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
"SORT_CODE": r"\d{2}-\d{2}-\d{2}",
"DATE_OF_BIRTH": r"\b\d{1,2}[\/\-]\d{1,2}[\/\-]\d{4}\b",
}
def mask_pii(text: str) -> MaskingResult:
masked = text
found_types = []
for pii_type, pattern in PII_PATTERNS.items():
if re.search(pattern, masked, re.IGNORECASE):
found_types.append(pii_type)
masked = re.sub(pattern, f"[{pii_type}]", masked, flags=re.IGNORECASE)
return MaskingResult(
masked_text=masked,
pii_found=len(found_types) > 0,
types_found=found_types
)
# Usage — always mask before sending to LLM or logging
user_input = "My email is john.smith@company.com and NI number is AB123456C"
result = mask_pii(user_input)
print(result.masked_text)
# "My email is [EMAIL] and NI number is [UK_NI_NUMBER]"
print(result.pii_found) # True
print(result.types_found) # ["EMAIL", "UK_NI_NUMBER"]
# In your query pipeline:
def safe_query(user_question: str, user_id: str) -> dict:
masking = mask_pii(user_question)
if masking.pii_found:
# Log the incident — someone submitted PII
audit_log({
"event": "pii_detected",
"user_id": user_id,
"pii_types": masking.types_found,
# Note: log that PII was found, NOT the actual PII
})
# Always use masked version for the LLM call
return rag_query(masking.masked_text)

For production: consider Microsoft Presidio — a more comprehensive, ML-backed PII detection library that handles names, addresses, and more nuanced PII.

# pip install presidio-analyzer presidio-anonymizer
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
def mask_pii_presidio(text: str) -> str:
results = analyzer.analyze(text=text, language="en")
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
return anonymized.text
# More accurate than regex for names and complex PII
mask_pii_presidio("John Smith from London called us on 07700 900123")
# "<PERSON> from <LOCATION> called us on <PHONE_NUMBER>"

Pillar 2: Audit Logging — Every Inference Is a Record

Every time your AI system produces an output, that output can be:

  • Incorrect and acted upon (business risk)
  • Evidence in a dispute (legal risk)
  • Data for improving the system (operational value)

You need an immutable log of every inference.

import json
import uuid
from datetime import datetime, timezone
from pathlib import Path
class AuditLogger:
def __init__(self, log_file: str = "audit.jsonl"):
self.log_file = Path(log_file)
def log(
self,
user_id: str,
session_id: str,
raw_question: str, # NEVER store raw if it contains PII
masked_question: str, # Store this instead
answer: str,
sources: list[str],
confidence: float,
model: str,
prompt_version: str,
input_tokens: int,
output_tokens: int,
latency_ms: int,
pii_detected: bool,
) -> str:
request_id = str(uuid.uuid4())
record = {
"request_id": request_id,
"timestamp": datetime.now(timezone.utc).isoformat(),
"user_id": user_id,
"session_id": session_id,
# PII-safe: log masked question, not raw
"question": masked_question,
"answer_length": len(answer), # length, not content, for privacy
"answer_preview": answer[:100] + "..." if len(answer) > 100 else answer,
"sources": sources,
"confidence": round(confidence, 3),
"model": model,
"prompt_version": prompt_version,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"cost_usd": round((input_tokens * 0.000003) + (output_tokens * 0.000015), 6),
"latency_ms": latency_ms,
"pii_detected": pii_detected,
}
# JSONL format — one record per line, easy to parse and stream
with open(self.log_file, "a", encoding="utf-8") as f:
f.write(json.dumps(record) + "\n")
return request_id
audit = AuditLogger("logs/audit.jsonl")
# After every inference:
request_id = audit.log(
user_id="user_123",
session_id="sess_456",
raw_question=user_question, # used for masking check only, not stored
masked_question=masking.masked_text,
answer=result["answer"],
sources=result["sources"],
confidence=result["confidence"],
model="claude-sonnet-4-6",
prompt_version="v2.3",
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
latency_ms=latency,
pii_detected=masking.pii_found
)

Where to store audit logs:

  • Development: local JSONL file
  • Production: Azure Blob Storage, S3, or a dedicated audit database
  • Retention: check your legal requirements — financial services may require 7+ years

Pillar 3: Confidence Scoring and Graceful Fallbacks

When your AI system is not confident enough to answer, it should say so — not guess.

from dataclasses import dataclass
@dataclass
class ConfidenceResult:
score: float # 0.0 to 1.0
level: str # "high", "medium", "low"
should_fallback: bool
disclaimer: str | None
UNCERTAINTY_SIGNALS = [
"i don't know", "i'm not sure", "unclear", "uncertain",
"cannot confirm", "might be", "possibly", "i think",
"not certain", "may not be accurate"
]
def score_confidence(
answer: str,
retrieved_chunks: list[str],
retrieval_distances: list[float]
) -> ConfidenceResult:
score = 1.0
# Penalty 1: uncertainty language in the answer
answer_lower = answer.lower()
uncertainty_count = sum(1 for signal in UNCERTAINTY_SIGNALS if signal in answer_lower)
score -= uncertainty_count * 0.15
# Penalty 2: poor retrieval (high cosine distance = low similarity)
if retrieval_distances:
avg_distance = sum(retrieval_distances) / len(retrieval_distances)
if avg_distance > 0.3: # poor similarity
score -= 0.25
elif avg_distance > 0.2:
score -= 0.10
# Penalty 3: no sources found
if not retrieved_chunks:
score = 0.0
# Bonus: answer grounded in retrieved text
if retrieved_chunks:
answer_words = set(answer_lower.split())
grounding = sum(
len(answer_words & set(chunk.lower().split()))
for chunk in retrieved_chunks
) / (len(answer_words) + 1)
score = min(1.0, score + grounding * 0.2)
score = max(0.0, min(1.0, score))
# Classify level and determine action
if score >= 0.8:
return ConfidenceResult(score=score, level="high", should_fallback=False, disclaimer=None)
elif score >= 0.6:
return ConfidenceResult(
score=score, level="medium", should_fallback=False,
disclaimer="⚠️ Moderate confidence. Please verify with the source documents."
)
else:
return ConfidenceResult(
score=score, level="low", should_fallback=True,
disclaimer=None # We won't show the low-confidence answer at all
)
# In your query pipeline:
confidence = score_confidence(answer, chunks, distances)
if confidence.should_fallback:
return {
"answer": (
"I don't have enough reliable information to answer this confidently. "
"Please contact the relevant team directly, or check the source documents."
),
"sources": [],
"confidence": confidence.score,
"fallback": True
}
response = {
"answer": answer,
"sources": sources,
"confidence": confidence.score,
}
if confidence.disclaimer:
response["disclaimer"] = confidence.disclaimer
return response

Pillar 4: Content Safety — Input and Output Guardrails

You need to protect against:

  • Prompt injection: users trying to hijack the system prompt
  • Harmful outputs: the model producing dangerous or inappropriate content
  • Jailbreaks: users trying to bypass your constraints
# Input guardrails
INJECTION_PATTERNS = [
r"ignore previous instructions",
r"forget your system prompt",
r"you are now",
r"act as",
r"new instruction",
r"override.*rules",
]
def check_prompt_injection(text: str) -> bool:
"""Returns True if injection attempt detected."""
text_lower = text.lower()
return any(re.search(pattern, text_lower) for pattern in INJECTION_PATTERNS)
# Output guardrails
def validate_output(answer: str) -> tuple[bool, str | None]:
"""
Returns (is_safe, rejection_reason).
Extend with your specific business rules.
"""
# Too short to be a real answer
if len(answer.strip()) < 10:
return False, "Response too short"
# Contains refusal markers (model declined to answer)
refusal_markers = ["i cannot", "i'm unable to", "i won't", "as an ai"]
if any(marker in answer.lower() for marker in refusal_markers):
return False, "Model declined"
# Check for PII in output (model should never output personal data)
output_masking = mask_pii(answer)
if output_masking.pii_found:
return False, f"Output contained PII: {output_masking.types_found}"
return True, None
# Usage
if check_prompt_injection(user_question):
audit_log({"event": "injection_attempt", "user_id": user_id})
return {"answer": "I cannot process that request.", "sources": [], "confidence": 0}
is_safe, rejection_reason = validate_output(answer)
if not is_safe:
logger.warning(f"Output rejected: {rejection_reason}")
return {"answer": "Unable to generate a safe response.", "sources": [], "confidence": 0}

Pillar 5: The Model Card — Documenting What You Built

A model card is a one-page document that says: here is what this system does, what it is good at, what it gets wrong, and who is responsible for it.

It is the difference between shipping an AI system and shipping a responsible AI system.

# Model Card: Internal Policy Assistant v2.3
## Overview
- **Purpose**: Answer questions about M&S HR and operational policies
- **Model**: claude-sonnet-4-6 (Anthropic) via Azure OpenAI proxy
- **Retrieval**: Azure AI Search (hybrid dense + sparse)
- **Deployed**: April 2026 | **Owner**: AI Engineering Team
## Intended Use
- ✅ Answering questions about company policies
- ✅ Summarising policy documents
- ❌ Making HR decisions (disciplinary, pay, leave approval)
- ❌ Legal or medical advice
- ❌ Questions outside the indexed document set
## Performance (on golden dataset, April 2026)
| Metric | Score | Threshold |
|--------|-------|-----------|
| Faithfulness | 0.94 | > 0.85 |
| Answer relevancy | 0.91 | > 0.80 |
| Context precision | 0.88 | > 0.75 |
| Avg latency | 1.2s | < 3.0s |
## Known Failure Modes
1. **Multi-policy questions**: questions spanning 3+ policies see 15% lower accuracy
2. **Ambiguous acronyms**: "WFH" correctly resolved, "TC" ambiguous (time-off-in-lieu vs technical change)
3. **Very recent policy updates**: new documents take 24h to appear in the index
## Governance
- PII masking: applied to all inputs before LLM call (regex + Presidio)
- Audit logging: every inference logged to Azure Blob (90-day retention)
- Confidence threshold: responses below 0.6 score trigger fallback message
- Human escalation: "Contact HR" path always available in UI
- Data residency: all processing within UK Azure regions
## Responsible AI Considerations
- No personal data stored in vector index
- No autonomous decisions — system is advisory only
- Output includes source citations for verification
- Users informed they are interacting with an AI assistant
## Contact
AI Engineering: vijay.anand@company.com
HR System Owner: hr-systems@company.com

Understanding the EU AI Act

If you are building AI systems in the UK/EU, you need to know this.

The EU AI Act (in force August 2024) classifies AI systems by risk:

UNACCEPTABLE RISK → Banned
Examples: social scoring systems, real-time biometric surveillance in public
Action: Do not build these
HIGH RISK → Strict requirements
Examples: AI in hiring, credit scoring, healthcare, law enforcement
Requirements:
- Conformity assessment
- Human oversight mandatory
- Audit logs (10-year retention)
- Accuracy, robustness, cybersecurity requirements
- Transparency to affected individuals
- Register in EU AI database
LIMITED RISK → Transparency obligations
Examples: chatbots, deepfakes
Requirements:
- Must disclose that users are interacting with AI
- Label AI-generated content
MINIMAL RISK → No specific requirements
Examples: spam filters, AI in video games
Action: Good practice still recommended

Classifying common AI systems:

SystemRisk TierKey requirement
HR CV screening toolHigh riskHuman reviews all decisions, audit trail
Customer chatbotLimited risk"You are chatting with an AI" disclosure
Internal document Q&AMinimal riskGood practice: confidence scores, fallback
Credit scoringHigh riskExplainability, bias testing, human override
Product recommendationsMinimal riskNo specific requirements
Medical diagnosis supportHigh riskCE marking equivalent, clinical validation

Bias Testing — The Step Most Teams Skip

An AI system that performs well on average can still systematically fail for specific groups. You need to test for this.

def bias_audit(test_cases: list[dict]) -> dict:
"""
Test whether the AI performs consistently across demographic groups.
test_cases: [{"question": "...", "group": "group_A", "expected": "..."}]
"""
results_by_group = {}
for case in test_cases:
answer = rag_query(case["question"])["answer"]
correct = case["expected"].lower() in answer.lower()
group = case["group"]
if group not in results_by_group:
results_by_group[group] = {"correct": 0, "total": 0}
results_by_group[group]["total"] += 1
if correct:
results_by_group[group]["correct"] += 1
# Calculate accuracy per group
accuracy_by_group = {
group: data["correct"] / data["total"]
for group, data in results_by_group.items()
}
# Alert if any group is significantly worse than average
avg_accuracy = sum(accuracy_by_group.values()) / len(accuracy_by_group)
for group, accuracy in accuracy_by_group.items():
if accuracy < avg_accuracy - 0.10: # >10% below average
print(f"⚠️ Potential bias: {group} accuracy {accuracy:.2f} vs avg {avg_accuracy:.2f}")
return accuracy_by_group
# Run quarterly
bias_results = bias_audit(bias_test_dataset)

The Governance Checklist

Before any AI system goes to production:

Data Protection
☐ PII masking on all inputs before LLM call
☐ PII validation on all outputs
☐ No personal data in vector index
☐ Data residency requirements met
Audit Trail
☐ Every inference logged (masked question, answer, confidence, model, user_id)
☐ Retention period defined and enforced
☐ Logs accessible for incident investigation
Quality & Safety
☐ Confidence scoring with fallback behaviour
☐ Prompt injection detection
☐ Output content validation
☐ Graceful failure modes tested
Documentation
☐ Model card written and approved
☐ Intended use and limitations documented
☐ Known failure modes documented
☐ Human escalation path defined
Compliance
☐ EU/UK AI Act risk tier assessed
☐ If high-risk: conformity assessment completed
☐ User disclosure ("you are interacting with AI") in place if required
☐ Bias audit completed on representative test cases
Ongoing
☐ Weekly quality evaluation on golden dataset
☐ Quarterly bias audit
☐ Incident response process defined
☐ Model card review cadence set

Putting It All Together

Here is the full governed query pipeline — the code that runs on every single inference:

def governed_query(user_question: str, user_id: str, session_id: str) -> dict:
start = time.time()
request_id = str(uuid.uuid4())
# 1. Input validation
if len(user_question) > 2000:
return {"error": "Question too long", "request_id": request_id}
# 2. Injection detection
if check_prompt_injection(user_question):
audit_log({"event": "injection_attempt", "user_id": user_id, "request_id": request_id})
return {"answer": "I cannot process that request.", "request_id": request_id}
# 3. PII masking
masking = mask_pii(user_question)
# 4. RAG query (using masked text)
result = rag_query(masking.masked_text)
# 5. Output validation
is_safe, rejection_reason = validate_output(result["answer"])
if not is_safe:
logger.warning(f"Output rejected [{request_id}]: {rejection_reason}")
result["answer"] = "Unable to generate a safe response. Please contact support."
result["confidence"] = 0.0
# 6. Confidence scoring
confidence = score_confidence(result["answer"], result.get("chunks", []), result.get("distances", []))
if confidence.should_fallback:
result["answer"] = (
"I don't have enough reliable information to answer this question confidently. "
"Please check the source documents directly or contact your team."
)
result["confidence"] = confidence.score
result["fallback"] = True
if confidence.disclaimer:
result["disclaimer"] = confidence.disclaimer
# 7. Audit log (everything, with masking)
latency_ms = int((time.time() - start) * 1000)
audit.log(
user_id=user_id,
session_id=session_id,
raw_question=user_question, # used for masking only
masked_question=masking.masked_text,
answer=result["answer"],
sources=result.get("sources", []),
confidence=result.get("confidence", 0.0),
model="claude-sonnet-4-6",
prompt_version="v2.3",
input_tokens=result.get("input_tokens", 0),
output_tokens=result.get("output_tokens", 0),
latency_ms=latency_ms,
pii_detected=masking.pii_found,
)
result["request_id"] = request_id
result["latency_ms"] = latency_ms
return result

Series Wrap-Up

You have completed the full journey from data/software engineer to AI engineer.

Here is what you now know how to do:

PartYou can now...
1 — MindsetExplain why AI systems require different engineering instincts
2 — LLMsExplain how a language model works, what tokens/embeddings/attention are
3 — PromptsWrite production-grade prompts using 6 core techniques
4 — RAGBuild a complete RAG pipeline with evaluation
5 — AgentsBuild an agent with tools using the ReAct pattern
6 — ProductionDeploy, monitor, and control costs for an AI system
7 — GovernanceImplement PII masking, audit logging, confidence scoring, and model cards

The AI engineer who implements all seven parts is not just building AI. They are building AI responsibly — and that is what companies pay senior rates for.


What Next?

Now that you have the foundations:

  1. Build something real — pick one use case from your current work and build it end-to-end using this series as a reference
  2. Get the Databricks GenAI Associate cert — validates your RAG and LLM knowledge formally
  3. Study LangGraph — for production-grade stateful agent workflows
  4. Read the EU AI Act summary — one hour, essential for any enterprise AI role
  5. Write your own model card — for something you have already built

The gap between knowing these concepts and building with them closes the moment you ship something real.


Written by Vijay Anand Pandian — AI Tech Lead & Senior Data Engineer at M&S Sparks. Building governed AI systems that bridge business and engineering in London, UK.