Public Preview

Blog Post 5 Quality Control Rebranded documentation

The Boomers Called It Quality Assurance. Gen AI Calls It Observability. Tomato, Tomahto.

Or: How the 60-Year-Old Principles of Building Reliable Systems Still Apply (Despite What the AGI Hype Says)


The Rebranding Game

I recently sat in a conference room listening to a vendor pitch their "revolutionary AI observability and evaluation platform."

The slides were slick. The terminology was fresh. "Real-time model monitoring." "Performance drift detection." "Systematic quality gates." "Continuous improvement loops."

<p class="aside"> If you've been in this industry long enough, you probably felt the same déjà vu I did. </p>

Halfway through, it hit me: This is just quality control with a new logo.

The boomers who built reliable manufacturing systems in the 1960s would look at modern "AI evaluation frameworks" and nod knowingly. They called it Quality Assurance. They called it Statistical Process Control. They called it Total Quality Management.

We call it Observability and Evaluations.

It's the same thing.

And the fact that we've forgotten this—that we're treating 60-year-old industrial engineering principles as "cutting-edge AI innovation"—tells you everything about where we are in the AI hype cycle.

A Brief History of Reliable Systems

Let's rewind to when building reliable systems was actually hard.

The 1950s-60s: The Quality Revolution

Post-WWII manufacturing had a problem: variability killed profitability.

You couldn't build reliable cars, electronics, or aircraft when every unit that rolled off the line was slightly different. Quality was inconsistent. Failures were unpredictable. Customers were unhappy.

Enter the quality pioneers:

W. Edwards Deming introduced Statistical Process Control. His insight? Measure, monitor, detect drift, correct.

<div class="knowing-nod"> Sound familiar? That's literally what every AI monitoring platform sells you today. We've just swapped "manufacturing process" for "model inference" and "control charts" for "dashboards." </div>

Joseph Juran gave us the concept of "fitness for purpose." Not every defect matters equally. Focus on what impacts outcomes.

(If you're building AI evals, you've reinvented this exact principle—not all model failures are equal. You prioritize what matters to users.)

Kaoru Ishikawa developed cause-and-effect diagrams ("fishbone diagrams") for root cause analysis. When quality fails, trace it back systematically.

(Modern "AI failure analysis" is this, verbatim. You trace back: Was it the data? The prompt? The model? The context? Same framework, new domain.)

The 1980s-90s: Systematic Quality Frameworks

By the time Six Sigma and Total Quality Management became mainstream, the playbook was clear:

PDCA (Plan-Do-Check-Act) - Deming's cycle:

  1. Plan what you'll measure
  2. Do (execute/deploy)
  3. Check results against expectations
  4. Act on deviations

DMAIC (Define-Measure-Analyze-Improve-Control) - Six Sigma's version:

  1. Define quality standards
  2. Measure performance
  3. Analyze gaps
  4. Improve processes
  5. Control for consistency

<p class="subtle-callout"> Anyone who's set up AI evaluation pipelines has done this exact sequence. You just didn't call it DMAIC. </p>

ISO 9001 - The international quality standard:

  • Document your processes
  • Measure your outputs
  • Monitor for drift
  • Continuously improve
  • Audit compliance

This became the lingua franca of reliable systems. Aerospace, automotive, medical devices, software—every industry that couldn't afford failures adopted these principles.

And then AI came along, and we acted like we'd never heard of quality control.

The AI Observability Rebranding

Let's map modern AI terminology to its 60-year-old roots:

Modern AI TermOld Quality TermWhat It Actually Is
ObservabilityProcess MonitoringWatching the system run
EvaluationQuality InspectionChecking if output meets spec
Performance DriftProcess DriftDetecting when things change
Model MonitoringStatistical Process ControlTracking metrics over time
Quality GatesAcceptance CriteriaGo/no-go decision points
A/B TestingExperimental DesignControlled comparison
Regression TestingQuality AssuranceVerify nothing broke
Root Cause AnalysisRoot Cause Analysis(literally the same thing)
Continuous ImprovementKaizenIterative refinement
Human-in-the-LoopInspector OversightHuman checks critical outputs

See the pattern?

We didn't invent a new discipline. We rebranded an existing one.

And because we forgot the source material, we're making the same mistakes industrial manufacturing made in the 1950s—and taking decades to relearn lessons that were already documented.

The Same Mistakes, Faster

Here's what happens when you ignore 60 years of quality engineering:

Mistake 1: "We'll Test It in Production"

1950s Manufacturing: "We'll catch defects when customers complain."
Result: Recalls, lawsuits, brand damage, bankruptcy.

2024 AI: "We'll monitor in production and fix issues as users report them."
Result: Hallucinations in customer-facing apps, compliance violations, reputational damage.

<div class="knowing-nod"> If you've ever deployed an LLM directly to production without systematic evaluation, then scrambled when users hit edge cases, you've lived this. The manufacturing industry learned this lesson in the 1960s. We're relearning it in the 2020s. </div>

What quality engineering taught us:
Catch defects before they reach customers. Build quality in, don't inspect it in. Prevention > Detection > Correction.

Mistake 2: "We'll Fix Issues Reactively"

1950s Manufacturing: "When something breaks, we'll figure out what went wrong."
Result: Firefighting culture, repeated failures, high costs.

2024 AI: "When the model fails, we'll look at the logs and patch the prompt."
Result: Whack-a-mole debugging, accumulated technical debt, brittle systems.

<p class="aside"> Anyone who's spent a weekend debugging why GPT-4 suddenly started refusing valid requests knows this pain. You patch one prompt, break another. No systematic approach, just reactive chaos. </p>

What quality engineering taught us:
Systematic root cause analysis. Use structured frameworks (5 Whys, fishbone diagrams, DMAIC) to find the true cause, not just the symptom. Fix the process, not the incident.

Mistake 3: "We Don't Need Formal QA, Our Engineers Are Good"

1950s Manufacturing: "Our craftsmen are skilled; we don't need quality inspectors."
Result: Inconsistent quality, no accountability, blame culture.

2024 AI: "Our ML engineers know what good models look like; we don't need formal evals."
Result: Each team has different quality standards, no consistency, deployment roulette.

What quality engineering taught us:
Separate concerns. The people building the system shouldn't be the only ones evaluating it. Independent quality assurance catches what builders miss. Not because builders are bad—because bias is human.

Mistake 4: "Quality Slows Us Down"

1950s Manufacturing: "Quality checks slow production; ship faster, worry later."
Result: Short-term speed, long-term disaster. Recalls cost 10-100x more than prevention.

2024 AI: "Evaluation pipelines slow iteration; we need to move fast."
Result: Technical debt accumulates, production incidents spike, remediation costs explode.

<div class="knowing-nod"> If your org is debating "Should we build evals or just ship faster?", you're having the exact same conversation Detroit had in 1955. They chose speed. It nearly destroyed the American auto industry. Toyota chose quality. They won. </div>

What quality engineering taught us:
Quality is faster in the long run. Prevention is 10x cheaper than detection, 100x cheaper than correction in production. Invest in quality gates upfront, or pay exponentially more later.

The Forgotten Frameworks (That Still Work)

Here's the irony: The frameworks for building reliable AI systems already exist. We just forgot to apply them.

Framework 1: PDCA for AI Development

Plan:

  • Define what "good" looks like (acceptance criteria)
  • Choose evaluation metrics (accuracy, latency, safety)
  • Set quality thresholds (what's acceptable?)

Do:

  • Deploy the model/system
  • Run it with real inputs
  • Collect performance data

Check:

  • Compare actual performance to planned thresholds
  • Identify deviations and drift
  • Flag anomalies

Act:

  • For deviations: root cause analysis
  • Adjust model, data, or prompts
  • Update quality standards if needed
  • Repeat cycle

<p class="subtle-callout"> This is exactly what mature AI teams do. They just don't call it PDCA. They call it "our ML ops process." </p>

Framework 2: Six Sigma DMAIC for AI Quality

Define:

  • What problem are we solving?
  • What does success look like?
  • Who are the stakeholders?

Measure:

  • Baseline performance (before)
  • Key metrics (accuracy, latency, cost)
  • Variation sources (where does quality drift?)

Analyze:

  • What causes failures?
  • What patterns exist in errors?
  • What's the root cause?

Improve:

  • Implement fixes
  • A/B test changes
  • Validate improvement

Control:

  • Monitor in production
  • Set up alerts for drift
  • Maintain quality over time

<div class="knowing-nod"> If you've ever run a "model improvement sprint," you've done DMAIC. You measured bad performance, analyzed why, improved the model, and set up monitoring. That's not innovation—that's industrial engineering applied to AI. </div>

Framework 3: Total Quality Management for AI Systems

Principle 1: Customer Focus
Quality is defined by user needs, not engineering metrics. A 99% accurate model that answers the wrong question is worthless.

Principle 2: Process Approach
Quality comes from process, not heroics. Build systematic evaluation into every stage of development, not just at the end.

Principle 3: Continuous Improvement
There's no "done." Monitor, measure, refine, repeat. Forever.

Principle 4: Evidence-Based Decisions
Gut feel and cherry-picked examples don't count. Use data. Use statistics. Be rigorous.

Principle 5: Relationship Management
Quality is a team sport. Engineering, product, ops, and users all have a role.

These aren't new insights. They're from the 1980s. And they work perfectly for AI systems.

The ISO 9001 for AI (That Doesn't Exist, But Should)

ISO 9001 is the international quality standard. It's how you prove your systems are reliable. Aerospace, medical devices, automotive—all require it.

AI has no equivalent.

We have fragmented best practices. We have vendor-specific tools. We have research papers. But we have no standardized quality framework for AI systems.

<p class="aside"> If you've ever tried to answer "How do we know this AI system is production-ready?", you've felt this gap. Manufacturing had clear answers in 1987. We're still winging it in 2024. </p>

Here's what an ISO 9001-style AI quality standard would look like:

1. Documentation Requirements

  • Document all training data sources and lineage
  • Document model architecture and hyperparameters
  • Document evaluation criteria and thresholds
  • Document deployment process and rollback procedures
  • Version everything.

2. Measurement Requirements

  • Define KPIs before deployment
  • Measure baseline performance
  • Set acceptable quality ranges
  • Track metrics continuously
  • Make it auditable.

3. Monitoring Requirements

  • Real-time performance monitoring
  • Drift detection (data, model, concept)
  • Alert thresholds for quality degradation
  • Automated health checks
  • Catch problems before users do.

4. Quality Gates

  • Pre-deployment evaluation (unit tests for AI)
  • Staging environment validation
  • Production canary testing
  • Go/no-go criteria enforced, not suggested
  • No shortcuts to production.

5. Continuous Improvement

  • Regular performance reviews
  • Root cause analysis for failures
  • Systematic improvement cycles
  • Knowledge sharing across teams
  • Learn from every incident.

6. Audit Trail

  • Log all model versions deployed
  • Log all evaluation results
  • Log all production incidents
  • Enable forensic analysis
  • "What happened and when?" must be answerable.

Sound excessive?

This is standard practice in manufacturing, aerospace, and medical devices. These industries can't afford failures.

Neither can AI. But we act like we can.

Why We Forgot: The AGI Distraction

Here's the uncomfortable question: If quality principles are 60 years old and proven, why did AI forget them?

The answer: We've been too busy chasing AGI to build reliable systems.

The AI field has been in a multi-decade race to artificial general intelligence. The focus has been:

  • "Can we make it smarter?"
  • "Can we scale it bigger?"
  • "Can we make it more general?"

Not:

  • "Can we make it reliable?"
  • "Can we make it consistent?"
  • "Can we make it safe?"

<div class="knowing-nod"> Anyone who's watched leadership prioritize "getting to GPT-5" over "making GPT-4 actually work reliably" has seen this priority mismatch. The field rewards capability advances, not reliability advances. We get better models, not more dependable systems. </div>

The result:

We have incredibly powerful AI that's also incredibly unreliable. Like a race car with no brakes. Impressive speed, terrifying in practice.

Manufacturing went through this exact phase in the 1950s. American auto makers prioritized features and output over quality. They built bigger, flashier cars faster than anyone.

Then Toyota showed up with boring, reliable cars—and won.

The quality revolution wasn't about making better cars in terms of features. It was about making reliable cars. Predictable. Consistent. Trustworthy.

AI is due for the same revolution.

The winners won't be whoever builds the smartest model. They'll be whoever builds the most reliable AI systems.

And to do that, we need to remember what the boomers already figured out.

The Practical Application: AI Quality Playbook

Enough history. How do you actually apply 60-year-old quality principles to modern AI?

Step 1: Define Quality (Before You Build)

Manufacturing approach: Write the spec before you start production.

AI approach: Define acceptance criteria before you train the model.

Practical questions:

  • What does "good" look like? (Not just "accurate"—useful, safe, fair, fast)
  • What's the minimum acceptable performance?
  • What failure modes are unacceptable? (Hallucinations? Bias? Crashes?)
  • Who decides if quality is sufficient? (Not just the ML team)

<p class="subtle-callout"> If you can't answer these questions, you're not ready to build. You're just hoping it'll work. </p>

Step 2: Measure Systematically (Not Ad Hoc)

Manufacturing approach: Statistical Process Control—measure everything, track over time, detect drift.

AI approach: Evaluation pipelines—automated, versioned, continuous.

Practical implementation:

  • Unit tests for AI (eval datasets, not code tests)
  • Integration tests (end-to-end workflows)
  • Regression tests (ensure updates don't break existing behavior)
  • Performance tests (latency, cost, throughput)
  • Safety tests (adversarial inputs, edge cases)

Run them:

  • Before every deployment (quality gate)
  • Continuously in production (monitoring)
  • After every incident (regression prevention)

Step 3: Separate Building from Evaluation

Manufacturing approach: Production line ≠ Quality inspection. Different people, different incentives.

AI approach: ML engineering ≠ ML evaluation. The team building the model shouldn't be the only team evaluating it.

Why this matters:

Builders optimize for what they can see. They miss what they're blind to. Independent evaluation catches:

  • Edge cases builders didn't consider
  • User needs builders don't understand
  • Failure modes builders rationalized away

<div class="knowing-nod"> If you've ever had a model pass internal evals beautifully, then fail immediately in user hands, you know why this separation matters. Your team isn't bad—they're just biased. Everyone is. </div>

Step 4: Implement PDCA Loops

Manufacturing approach: Continuous improvement cycles. Small, frequent adjustments based on measurement.

AI approach: Iterative model refinement based on production data.

Practical cycle:

  1. Plan: Set improvement goal (e.g., "Reduce hallucination rate by 20%")
  2. Do: Implement change (prompt engineering, fine-tuning, RAG, etc.)
  3. Check: Measure actual improvement (did hallucinations drop?)
  4. Act: If it worked, deploy; if not, try something else

Frequency: Weekly or bi-weekly. Small iterations beat big rewrites.

Step 5: Root Cause Analysis (Not Symptom Whack-a-Mole)

Manufacturing approach: When quality fails, use structured frameworks to find the real cause.

AI approach: When the model fails, don't just patch the prompt. Find out why.

5 Whys Example:

Incident: Model gave wrong answer to user query

  1. Why? Model didn't have relevant context
  2. Why? RAG retrieval failed
  3. Why? User query was rephrased poorly
  4. Why? Query rewriting prompt was too generic
  5. Why? We didn't test query rewriting separately

Root cause: Lack of modular evaluation. We tested end-to-end but not components.

Fix: Add unit tests for query rewriting. Not just "fix this prompt."

Step 6: Make Quality Everyone's Job

Manufacturing approach: TQM—quality is organization-wide, not just the QA department's problem.

AI approach: Everyone owns reliability—engineering, product, ops, leadership.

Cultural shift:

  • Engineering: "Did we evaluate this thoroughly?" is as important as "Does it work?"
  • Product: "What quality metrics matter to users?" drives prioritization
  • Ops: "Can we detect issues before users do?" is a design requirement
  • Leadership: "Quality velocity" (sustainable pace) beats "feature velocity" (fast but brittle)

<p class="aside"> If your org celebrates "we shipped 10 features this quarter" but not "we had zero production incidents this quarter," your incentives are broken. </p>

The Business Case: Quality is Profitable

Let's talk ROI, because quality engineering isn't charity—it's strategic.

Cost Avoidance

Prevention cost: $1
Detection cost: $10
Correction cost (in production): $100
Crisis cost (PR/legal/trust): $1,000+

<div class="knowing-nod"> Anyone who's had to pause a product launch because the AI broke in production knows this math viscerally. The eval pipeline you "didn't have time for" would have cost $50K. The production incident cost $2M in engineering time, customer churn, and delayed revenue. </div>

Real example:

Company A: "We'll build evals later, let's ship fast."

  • Month 1-3: Fast feature velocity, team happy
  • Month 4: Production incident, 2 weeks to fix
  • Month 6: Another incident, trust eroding
  • Month 9: Building the eval infrastructure they should have built in Month 1
  • Cost: 6 months of technical debt, multiple incidents, deferred revenue

Company B: "We'll build evals upfront."

  • Month 1: Slower start, building test infrastructure
  • Month 2-9: Steady velocity, high confidence deployments, zero major incidents
  • Cost: 2 weeks upfront investment
  • Savings: 6+ months of firefighting avoided

The math is obvious. We just don't do the math.

Competitive Moat

In 2024, everyone has access to GPT-4, Claude, Gemini. The models are commoditized.

Differentiation isn't model capability. It's system reliability.

The company that can dependably deliver AI value wins. Not the one with the smartest model—the one with the most predictable model.

Think about it:

Would you rather use:

  • An AI that's brilliant 80% of the time but unpredictable
  • An AI that's good 95% of the time and you know when it'll fail

Enterprise customers choose reliability over capability. Every. Single. Time.

Quality engineering is your competitive advantage. Boomers figured this out in 1960. We're relearning it in 2024.

Conclusion: Everything Old is New Again

The boomers called it Quality Assurance and Control.
Gen AI calls it Observability and Evaluations.
Tomato, tomahto.

The principles of building reliable systems haven't changed:

  • Define quality before you build
  • Measure systematically
  • Detect issues early
  • Fix root causes, not symptoms
  • Improve continuously
  • Make quality everyone's job

These worked in 1960. They work in 2024.

The AI field spent decades chasing AGI. We built incredible capabilities. But we forgot the basics of reliability engineering.

<p class="aside"> If you've ever felt like you're reinventing the wheel as you build AI evaluation systems, you're not wrong. You are reinventing the wheel. The boomers already invented it. We just forgot to read the manual. </p>

The good news: The manual exists. It's called quality engineering. It's proven. It scales. It works.

The better news: You don't have to start from scratch. Just dust off the frameworks, swap "manufacturing process" for "AI model," and apply them.

PDCA. DMAIC. Six Sigma. ISO 9001. These aren't relics. They're blueprints.

The AI systems that win won't be the ones with the most advanced models. They'll be the ones with the most disciplined quality processes.

The boomers knew this. Time we relearn it.


Appendix: Quick Reference - Quality Engineering for AI

PDCA for AI

PhaseAI Application
PlanDefine eval metrics, acceptance criteria, quality thresholds
DoDeploy model, run inference, collect data
CheckCompare performance to thresholds, detect drift
ActRoot cause analysis, implement fixes, update standards

DMAIC for Model Improvement

PhaseAI Application
DefineWhat problem? What success? Which users?
MeasureBaseline metrics, variation sources, failure modes
AnalyzeRoot cause of failures, error patterns, systemic issues
ImproveFix implementation, A/B test, validate results
ControlProduction monitoring, drift alerts, sustained quality

Quality Checklist for AI Deployment

  • [ ] Acceptance criteria defined before building
  • [ ] Evaluation dataset created and versioned
  • [ ] Automated eval pipeline in place
  • [ ] Quality gates enforced (cannot deploy without passing)
  • [ ] Independent evaluation (not just builders)
  • [ ] Production monitoring configured
  • [ ] Drift detection alerts set
  • [ ] Incident response process documented
  • [ ] Root cause analysis framework adopted
  • [ ] Continuous improvement cycles scheduled

Cost Comparison

ApproachUpfront CostIncident CostTotal (1 year)
No QA$0$500K (5 incidents)$500K
Reactive QA$50K$200K (2 incidents)$250K
Proactive QA$100K$20K (0-1 minor)$120K

ROI of quality engineering: 4x+


Key Takeaways:

  1. AI observability = Quality control (rebranded)
  2. The frameworks already exist—PDCA, DMAIC, Six Sigma, ISO 9001
  3. Prevention is 10-100x cheaper than correction
  4. Reliability is the new competitive moat (not capability)
  5. Make quality everyone's job, not just ML engineering
  6. The boomers solved this in 1960. Read the manual.

The quality revolution happened 60 years ago. AI is just catching up.

Which company will you be—the one that forgets history, or the one that learns from it?