From Data Engineer to AI Engineer — Part 7: AI Governance — Building AI You Can Trust

Series: From Data/Software Engineer to AI Engineer Part 7 of 7 — ← Part 6: Production AI

Why Governance Is an Engineering Problem

"AI governance" sounds like something for the compliance team. It is not.

Every technical decision you make is a governance decision:

How you handle user data in prompts
Whether you log model outputs for audit
How you handle low-confidence responses
Whether you let the model make autonomous decisions

The engineer who says "governance isn't my job" is the one who gets called when the model leaks a customer's personal data or gives a confident, wrong answer that costs the business money.

This is not abstract. It is concrete code you write today.

The Five Governance Pillars (With Code)

Pillar 1: PII Protection — Never Let Personal Data Into the LLM Unmasked

The problem: users ask questions that contain personal information. If you log those queries, you have a data protection issue. If you send them to a third-party LLM API, you may violate GDPR.

The solution: mask before it ever leaves your system.

import re
from dataclasses import dataclass

@dataclass
class MaskingResult:
    masked_text: str
    pii_found: bool
    types_found: list[str]

# UK-specific PII patterns
PII_PATTERNS = {
    "EMAIL": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
    "UK_PHONE": r"(\+44\s?|0)(\d\s?){9,10}",
    "UK_NI_NUMBER": r"[A-Z]{2}\s?\d{2}\s?\d{2}\s?\d{2}\s?[A-D]",
    "UK_POSTCODE": r"[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}",
    "CARD_NUMBER": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
    "SORT_CODE": r"\d{2}-\d{2}-\d{2}",
    "DATE_OF_BIRTH": r"\b\d{1,2}[\/\-]\d{1,2}[\/\-]\d{4}\b",
}

def mask_pii(text: str) -> MaskingResult:
    masked = text
    found_types = []
    
    for pii_type, pattern in PII_PATTERNS.items():
        if re.search(pattern, masked, re.IGNORECASE):
            found_types.append(pii_type)
            masked = re.sub(pattern, f"[{pii_type}]", masked, flags=re.IGNORECASE)
    
    return MaskingResult(
        masked_text=masked,
        pii_found=len(found_types) > 0,
        types_found=found_types
    )

# Usage — always mask before sending to LLM or logging
user_input = "My email is john.smith@company.com and NI number is AB123456C"
result = mask_pii(user_input)

print(result.masked_text)
# "My email is [EMAIL] and NI number is [UK_NI_NUMBER]"
print(result.pii_found)   # True
print(result.types_found) # ["EMAIL", "UK_NI_NUMBER"]

# In your query pipeline:
def safe_query(user_question: str, user_id: str) -> dict:
    masking = mask_pii(user_question)
    
    if masking.pii_found:
        # Log the incident — someone submitted PII
        audit_log({
            "event": "pii_detected",
            "user_id": user_id,
            "pii_types": masking.types_found,
            # Note: log that PII was found, NOT the actual PII
        })
    
    # Always use masked version for the LLM call
    return rag_query(masking.masked_text)

For production: consider Microsoft Presidio — a more comprehensive, ML-backed PII detection library that handles names, addresses, and more nuanced PII.

# pip install presidio-analyzer presidio-anonymizer
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def mask_pii_presidio(text: str) -> str:
    results = analyzer.analyze(text=text, language="en")
    anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
    return anonymized.text

# More accurate than regex for names and complex PII
mask_pii_presidio("John Smith from London called us on 07700 900123")
# "<PERSON> from <LOCATION> called us on <PHONE_NUMBER>"

Pillar 2: Audit Logging — Every Inference Is a Record

Every time your AI system produces an output, that output can be:

Incorrect and acted upon (business risk)
Evidence in a dispute (legal risk)
Data for improving the system (operational value)

You need an immutable log of every inference.

import json
import uuid
from datetime import datetime, timezone
from pathlib import Path

class AuditLogger:
    def __init__(self, log_file: str = "audit.jsonl"):
        self.log_file = Path(log_file)
    
    def log(
        self,
        user_id: str,
        session_id: str,
        raw_question: str,           # NEVER store raw if it contains PII
        masked_question: str,        # Store this instead
        answer: str,
        sources: list[str],
        confidence: float,
        model: str,
        prompt_version: str,
        input_tokens: int,
        output_tokens: int,
        latency_ms: int,
        pii_detected: bool,
    ) -> str:
        request_id = str(uuid.uuid4())
        
        record = {
            "request_id": request_id,
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "user_id": user_id,
            "session_id": session_id,
            # PII-safe: log masked question, not raw
            "question": masked_question,
            "answer_length": len(answer),  # length, not content, for privacy
            "answer_preview": answer[:100] + "..." if len(answer) > 100 else answer,
            "sources": sources,
            "confidence": round(confidence, 3),
            "model": model,
            "prompt_version": prompt_version,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "cost_usd": round((input_tokens * 0.000003) + (output_tokens * 0.000015), 6),
            "latency_ms": latency_ms,
            "pii_detected": pii_detected,
        }
        
        # JSONL format — one record per line, easy to parse and stream
        with open(self.log_file, "a", encoding="utf-8") as f:
            f.write(json.dumps(record) + "\n")
        
        return request_id

audit = AuditLogger("logs/audit.jsonl")

# After every inference:
request_id = audit.log(
    user_id="user_123",
    session_id="sess_456",
    raw_question=user_question,  # used for masking check only, not stored
    masked_question=masking.masked_text,
    answer=result["answer"],
    sources=result["sources"],
    confidence=result["confidence"],
    model="claude-sonnet-4-6",
    prompt_version="v2.3",
    input_tokens=response.usage.input_tokens,
    output_tokens=response.usage.output_tokens,
    latency_ms=latency,
    pii_detected=masking.pii_found
)

Where to store audit logs:

Development: local JSONL file
Production: Azure Blob Storage, S3, or a dedicated audit database
Retention: check your legal requirements — financial services may require 7+ years

Pillar 3: Confidence Scoring and Graceful Fallbacks

When your AI system is not confident enough to answer, it should say so — not guess.

from dataclasses import dataclass

@dataclass
class ConfidenceResult:
    score: float          # 0.0 to 1.0
    level: str            # "high", "medium", "low"
    should_fallback: bool
    disclaimer: str | None

UNCERTAINTY_SIGNALS = [
    "i don't know", "i'm not sure", "unclear", "uncertain",
    "cannot confirm", "might be", "possibly", "i think",
    "not certain", "may not be accurate"
]

def score_confidence(
    answer: str,
    retrieved_chunks: list[str],
    retrieval_distances: list[float]
) -> ConfidenceResult:
    
    score = 1.0
    
    # Penalty 1: uncertainty language in the answer
    answer_lower = answer.lower()
    uncertainty_count = sum(1 for signal in UNCERTAINTY_SIGNALS if signal in answer_lower)
    score -= uncertainty_count * 0.15
    
    # Penalty 2: poor retrieval (high cosine distance = low similarity)
    if retrieval_distances:
        avg_distance = sum(retrieval_distances) / len(retrieval_distances)
        if avg_distance > 0.3:  # poor similarity
            score -= 0.25
        elif avg_distance > 0.2:
            score -= 0.10
    
    # Penalty 3: no sources found
    if not retrieved_chunks:
        score = 0.0
    
    # Bonus: answer grounded in retrieved text
    if retrieved_chunks:
        answer_words = set(answer_lower.split())
        grounding = sum(
            len(answer_words & set(chunk.lower().split())) 
            for chunk in retrieved_chunks
        ) / (len(answer_words) + 1)
        score = min(1.0, score + grounding * 0.2)
    
    score = max(0.0, min(1.0, score))
    
    # Classify level and determine action
    if score >= 0.8:
        return ConfidenceResult(score=score, level="high", should_fallback=False, disclaimer=None)
    elif score >= 0.6:
        return ConfidenceResult(
            score=score, level="medium", should_fallback=False,
            disclaimer="⚠️ Moderate confidence. Please verify with the source documents."
        )
    else:
        return ConfidenceResult(
            score=score, level="low", should_fallback=True,
            disclaimer=None  # We won't show the low-confidence answer at all
        )

# In your query pipeline:
confidence = score_confidence(answer, chunks, distances)

if confidence.should_fallback:
    return {
        "answer": (
            "I don't have enough reliable information to answer this confidently. "
            "Please contact the relevant team directly, or check the source documents."
        ),
        "sources": [],
        "confidence": confidence.score,
        "fallback": True
    }

response = {
    "answer": answer,
    "sources": sources,
    "confidence": confidence.score,
}

if confidence.disclaimer:
    response["disclaimer"] = confidence.disclaimer

return response

Pillar 4: Content Safety — Input and Output Guardrails

You need to protect against:

Prompt injection: users trying to hijack the system prompt
Harmful outputs: the model producing dangerous or inappropriate content
Jailbreaks: users trying to bypass your constraints

# Input guardrails
INJECTION_PATTERNS = [
    r"ignore previous instructions",
    r"forget your system prompt",
    r"you are now",
    r"act as",
    r"new instruction",
    r"override.*rules",
]

def check_prompt_injection(text: str) -> bool:
    """Returns True if injection attempt detected."""
    text_lower = text.lower()
    return any(re.search(pattern, text_lower) for pattern in INJECTION_PATTERNS)

# Output guardrails
def validate_output(answer: str) -> tuple[bool, str | None]:
    """
    Returns (is_safe, rejection_reason).
    Extend with your specific business rules.
    """
    # Too short to be a real answer
    if len(answer.strip()) < 10:
        return False, "Response too short"
    
    # Contains refusal markers (model declined to answer)
    refusal_markers = ["i cannot", "i'm unable to", "i won't", "as an ai"]
    if any(marker in answer.lower() for marker in refusal_markers):
        return False, "Model declined"
    
    # Check for PII in output (model should never output personal data)
    output_masking = mask_pii(answer)
    if output_masking.pii_found:
        return False, f"Output contained PII: {output_masking.types_found}"
    
    return True, None

# Usage
if check_prompt_injection(user_question):
    audit_log({"event": "injection_attempt", "user_id": user_id})
    return {"answer": "I cannot process that request.", "sources": [], "confidence": 0}

is_safe, rejection_reason = validate_output(answer)
if not is_safe:
    logger.warning(f"Output rejected: {rejection_reason}")
    return {"answer": "Unable to generate a safe response.", "sources": [], "confidence": 0}

Pillar 5: The Model Card — Documenting What You Built

A model card is a one-page document that says: here is what this system does, what it is good at, what it gets wrong, and who is responsible for it.

It is the difference between shipping an AI system and shipping a responsible AI system.

# Model Card: Internal Policy Assistant v2.3 (illustrative example)

## Overview
- **Purpose**: Answer questions about internal HR and operational policies
- **Model**: claude-sonnet-4-6 (Anthropic) via Azure OpenAI proxy
- **Retrieval**: Azure AI Search (hybrid dense + sparse)
- **Deployed**: April 2026 | **Owner**: AI Engineering Team

## Intended Use
- ✅ Answering questions about company policies
- ✅ Summarising policy documents
- ❌ Making HR decisions (disciplinary, pay, leave approval)
- ❌ Legal or medical advice
- ❌ Questions outside the indexed document set

## Performance (on golden dataset, April 2026)
| Metric | Score | Threshold |
|--------|-------|-----------|
| Faithfulness | 0.94 | > 0.85 |
| Answer relevancy | 0.91 | > 0.80 |
| Context precision | 0.88 | > 0.75 |
| Avg latency | 1.2s | < 3.0s |

## Known Failure Modes
1. **Multi-policy questions**: questions spanning 3+ policies see 15% lower accuracy
2. **Ambiguous acronyms**: "WFH" correctly resolved, "TC" ambiguous (time-off-in-lieu vs technical change)
3. **Very recent policy updates**: new documents take 24h to appear in the index

## Governance
- PII masking: applied to all inputs before LLM call (regex + Presidio)
- Audit logging: every inference logged to Azure Blob (90-day retention)
- Confidence threshold: responses below 0.6 score trigger fallback message
- Human escalation: "Contact HR" path always available in UI
- Data residency: all processing within UK Azure regions

## Responsible AI Considerations
- No personal data stored in vector index
- No autonomous decisions — system is advisory only
- Output includes source citations for verification
- Users informed they are interacting with an AI assistant

## Contact
AI Engineering: vijay.anand@company.com
HR System Owner: hr-systems@company.com

Understanding the EU AI Act

If you are building AI systems in the UK/EU, you need to know this.

The EU AI Act (in force August 2024) classifies AI systems by risk:

UNACCEPTABLE RISK → Banned
  Examples: social scoring systems, real-time biometric surveillance in public
  Action: Do not build these

HIGH RISK → Strict requirements
  Examples: AI in hiring, credit scoring, healthcare, law enforcement
  Requirements:
    - Conformity assessment
    - Human oversight mandatory
    - Audit logs (10-year retention)
    - Accuracy, robustness, cybersecurity requirements
    - Transparency to affected individuals
    - Register in EU AI database

LIMITED RISK → Transparency obligations
  Examples: chatbots, deepfakes
  Requirements:
    - Must disclose that users are interacting with AI
    - Label AI-generated content

MINIMAL RISK → No specific requirements
  Examples: spam filters, AI in video games
  Action: Good practice still recommended

Classifying common AI systems:

System	Risk Tier	Key requirement
HR CV screening tool	High risk	Human reviews all decisions, audit trail
Customer chatbot	Limited risk	"You are chatting with an AI" disclosure
Internal document Q&A	Minimal risk	Good practice: confidence scores, fallback
Credit scoring	High risk	Explainability, bias testing, human override
Product recommendations	Minimal risk	No specific requirements
Medical diagnosis support	High risk	CE marking equivalent, clinical validation

Bias Testing — The Step Most Teams Skip

An AI system that performs well on average can still systematically fail for specific groups. You need to test for this.

def bias_audit(test_cases: list[dict]) -> dict:
    """
    Test whether the AI performs consistently across demographic groups.
    test_cases: [{"question": "...", "group": "group_A", "expected": "..."}]
    """
    results_by_group = {}
    
    for case in test_cases:
        answer = rag_query(case["question"])["answer"]
        correct = case["expected"].lower() in answer.lower()
        
        group = case["group"]
        if group not in results_by_group:
            results_by_group[group] = {"correct": 0, "total": 0}
        
        results_by_group[group]["total"] += 1
        if correct:
            results_by_group[group]["correct"] += 1
    
    # Calculate accuracy per group
    accuracy_by_group = {
        group: data["correct"] / data["total"]
        for group, data in results_by_group.items()
    }
    
    # Alert if any group is significantly worse than average
    avg_accuracy = sum(accuracy_by_group.values()) / len(accuracy_by_group)
    for group, accuracy in accuracy_by_group.items():
        if accuracy < avg_accuracy - 0.10:  # >10% below average
            print(f"⚠️  Potential bias: {group} accuracy {accuracy:.2f} vs avg {avg_accuracy:.2f}")
    
    return accuracy_by_group

# Run quarterly
bias_results = bias_audit(bias_test_dataset)

The Governance Checklist

Before any AI system goes to production:

Data Protection
  ☐ PII masking on all inputs before LLM call
  ☐ PII validation on all outputs
  ☐ No personal data in vector index
  ☐ Data residency requirements met

Audit Trail
  ☐ Every inference logged (masked question, answer, confidence, model, user_id)
  ☐ Retention period defined and enforced
  ☐ Logs accessible for incident investigation

Quality & Safety
  ☐ Confidence scoring with fallback behaviour
  ☐ Prompt injection detection
  ☐ Output content validation
  ☐ Graceful failure modes tested

Documentation
  ☐ Model card written and approved
  ☐ Intended use and limitations documented
  ☐ Known failure modes documented
  ☐ Human escalation path defined

Compliance
  ☐ EU/UK AI Act risk tier assessed
  ☐ If high-risk: conformity assessment completed
  ☐ User disclosure ("you are interacting with AI") in place if required
  ☐ Bias audit completed on representative test cases

Ongoing
  ☐ Weekly quality evaluation on golden dataset
  ☐ Quarterly bias audit
  ☐ Incident response process defined
  ☐ Model card review cadence set

Putting It All Together

Here is the full governed query pipeline — the code that runs on every single inference:

def governed_query(user_question: str, user_id: str, session_id: str) -> dict:
    start = time.time()
    request_id = str(uuid.uuid4())
    
    # 1. Input validation
    if len(user_question) > 2000:
        return {"error": "Question too long", "request_id": request_id}
    
    # 2. Injection detection
    if check_prompt_injection(user_question):
        audit_log({"event": "injection_attempt", "user_id": user_id, "request_id": request_id})
        return {"answer": "I cannot process that request.", "request_id": request_id}
    
    # 3. PII masking
    masking = mask_pii(user_question)
    
    # 4. RAG query (using masked text)
    result = rag_query(masking.masked_text)
    
    # 5. Output validation
    is_safe, rejection_reason = validate_output(result["answer"])
    if not is_safe:
        logger.warning(f"Output rejected [{request_id}]: {rejection_reason}")
        result["answer"] = "Unable to generate a safe response. Please contact support."
        result["confidence"] = 0.0
    
    # 6. Confidence scoring
    confidence = score_confidence(result["answer"], result.get("chunks", []), result.get("distances", []))
    
    if confidence.should_fallback:
        result["answer"] = (
            "I don't have enough reliable information to answer this question confidently. "
            "Please check the source documents directly or contact your team."
        )
        result["confidence"] = confidence.score
        result["fallback"] = True
    
    if confidence.disclaimer:
        result["disclaimer"] = confidence.disclaimer
    
    # 7. Audit log (everything, with masking)
    latency_ms = int((time.time() - start) * 1000)
    audit.log(
        user_id=user_id,
        session_id=session_id,
        raw_question=user_question,       # used for masking only
        masked_question=masking.masked_text,
        answer=result["answer"],
        sources=result.get("sources", []),
        confidence=result.get("confidence", 0.0),
        model="claude-sonnet-4-6",
        prompt_version="v2.3",
        input_tokens=result.get("input_tokens", 0),
        output_tokens=result.get("output_tokens", 0),
        latency_ms=latency_ms,
        pii_detected=masking.pii_found,
    )
    
    result["request_id"] = request_id
    result["latency_ms"] = latency_ms
    return result

Series Wrap-Up

You have completed the full journey from data/software engineer to AI engineer.

Here is what you now know how to do:

Part	You can now...
1 — Mindset	Explain why AI systems require different engineering instincts
2 — LLMs	Explain how a language model works, what tokens/embeddings/attention are
3 — Prompts	Write production-grade prompts using 6 core techniques
4 — RAG	Build a complete RAG pipeline with evaluation
5 — Agents	Build an agent with tools using the ReAct pattern
6 — Production	Deploy, monitor, and control costs for an AI system
7 — Governance	Implement PII masking, audit logging, confidence scoring, and model cards

The AI engineer who implements all seven parts is not just building AI. They are building AI responsibly — and that is what companies pay senior rates for.

What Next?

Now that you have the foundations:

Build something real — pick one use case from your current work and build it end-to-end using this series as a reference
Get the Databricks GenAI Associate cert — validates your RAG and LLM knowledge formally
Study LangGraph — for production-grade stateful agent workflows
Read the EU AI Act summary — one hour, essential for any enterprise AI role
Write your own model card — for something you have already built

The gap between knowing these concepts and building with them closes the moment you ship something real.

Written by Vijay Anand Pandian — Senior Data Engineer at M&S Sparks. Building governed AI systems that bridge business and engineering in London, UK.