The Boomers Called It Quality Assurance. Gen AI Calls It Observability. Tomato, Tomahto.

Or: How the 60-Year-Old Principles of Building Reliable Systems Still Apply (Despite What the AGI Hype Says)

The Rebranding Game

I recently sat in a conference room listening to a vendor pitch their "revolutionary AI observability and evaluation platform."

The slides were slick. The terminology was fresh. "Real-time model monitoring." "Performance drift detection." "Systematic quality gates." "Continuous improvement loops."

<p class="aside"> If you've been in this industry long enough, you probably felt the same déjà vu I did. </p>

Halfway through, it hit me: This is just quality control with a new logo.

The boomers who built reliable manufacturing systems in the 1960s would look at modern "AI evaluation frameworks" and nod knowingly. They called it Quality Assurance. They called it Statistical Process Control. They called it Total Quality Management.

We call it Observability and Evaluations.

It's the same thing.

And the fact that we've forgotten this—that we're treating 60-year-old industrial engineering principles as "cutting-edge AI innovation"—tells you everything about where we are in the AI hype cycle.

A Brief History of Reliable Systems

Let's rewind to when building reliable systems was actually hard.

The 1950s-60s: The Quality Revolution

Post-WWII manufacturing had a problem: variability killed profitability.

You couldn't build reliable cars, electronics, or aircraft when every unit that rolled off the line was slightly different. Quality was inconsistent. Failures were unpredictable. Customers were unhappy.

Enter the quality pioneers:

W. Edwards Deming introduced Statistical Process Control. His insight? Measure, monitor, detect drift, correct.

<div class="knowing-nod"> Sound familiar? That's literally what every AI monitoring platform sells you today. We've just swapped "manufacturing process" for "model inference" and "control charts" for "dashboards." </div>

Joseph Juran gave us the concept of "fitness for purpose." Not every defect matters equally. Focus on what impacts outcomes.

(If you're building AI evals, you've reinvented this exact principle—not all model failures are equal. You prioritize what matters to users.)

Kaoru Ishikawa developed cause-and-effect diagrams ("fishbone diagrams") for root cause analysis. When quality fails, trace it back systematically.

(Modern "AI failure analysis" is this, verbatim. You trace back: Was it the data? The prompt? The model? The context? Same framework, new domain.)

The 1980s-90s: Systematic Quality Frameworks

By the time Six Sigma and Total Quality Management became mainstream, the playbook was clear:

PDCA (Plan-Do-Check-Act) - Deming's cycle:

Plan what you'll measure
Do (execute/deploy)
Check results against expectations
Act on deviations

DMAIC (Define-Measure-Analyze-Improve-Control) - Six Sigma's version:

Define quality standards
Measure performance
Analyze gaps
Improve processes
Control for consistency

<p class="subtle-callout"> Anyone who's set up AI evaluation pipelines has done this exact sequence. You just didn't call it DMAIC. </p>

ISO 9001 - The international quality standard:

Document your processes
Measure your outputs
Monitor for drift
Continuously improve
Audit compliance

This became the lingua franca of reliable systems. Aerospace, automotive, medical devices, software—every industry that couldn't afford failures adopted these principles.

And then AI came along, and we acted like we'd never heard of quality control.

The AI Observability Rebranding

Let's map modern AI terminology to its 60-year-old roots:

Modern AI Term	Old Quality Term	What It Actually Is
Observability	Process Monitoring	Watching the system run
Evaluation	Quality Inspection	Checking if output meets spec
Performance Drift	Process Drift	Detecting when things change
Model Monitoring	Statistical Process Control	Tracking metrics over time
Quality Gates	Acceptance Criteria	Go/no-go decision points
A/B Testing	Experimental Design	Controlled comparison
Regression Testing	Quality Assurance	Verify nothing broke
Root Cause Analysis	Root Cause Analysis	(literally the same thing)
Continuous Improvement	Kaizen	Iterative refinement
Human-in-the-Loop	Inspector Oversight	Human checks critical outputs

See the pattern?

We didn't invent a new discipline. We rebranded an existing one.

And because we forgot the source material, we're making the same mistakes industrial manufacturing made in the 1950s—and taking decades to relearn lessons that were already documented.

The Same Mistakes, Faster

Here's what happens when you ignore 60 years of quality engineering:

Mistake 1: "We'll Test It in Production"

1950s Manufacturing: "We'll catch defects when customers complain."
Result: Recalls, lawsuits, brand damage, bankruptcy.

2024 AI: "We'll monitor in production and fix issues as users report them."
Result: Hallucinations in customer-facing apps, compliance violations, reputational damage.

<div class="knowing-nod"> If you've ever deployed an LLM directly to production without systematic evaluation, then scrambled when users hit edge cases, you've lived this. The manufacturing industry learned this lesson in the 1960s. We're relearning it in the 2020s. </div>

What quality engineering taught us:
Catch defects before they reach customers. Build quality in, don't inspect it in. Prevention > Detection > Correction.

Mistake 2: "We'll Fix Issues Reactively"

1950s Manufacturing: "When something breaks, we'll figure out what went wrong."
Result: Firefighting culture, repeated failures, high costs.

2024 AI: "When the model fails, we'll look at the logs and patch the prompt."
Result: Whack-a-mole debugging, accumulated technical debt, brittle systems.

<p class="aside"> Anyone who's spent a weekend debugging why GPT-4 suddenly started refusing valid requests knows this pain. You patch one prompt, break another. No systematic approach, just reactive chaos. </p>

What quality engineering taught us:
Systematic root cause analysis. Use structured frameworks (5 Whys, fishbone diagrams, DMAIC) to find the true cause, not just the symptom. Fix the process, not the incident.

Mistake 3: "We Don't Need Formal QA, Our Engineers Are Good"

1950s Manufacturing: "Our craftsmen are skilled; we don't need quality inspectors."
Result: Inconsistent quality, no accountability, blame culture.

2024 AI: "Our ML engineers know what good models look like; we don't need formal evals."
Result: Each team has different quality standards, no consistency, deployment roulette.

What quality engineering taught us:
Separate concerns. The people building the system shouldn't be the only ones evaluating it. Independent quality assurance catches what builders miss. Not because builders are bad—because bias is human.

Mistake 4: "Quality Slows Us Down"

1950s Manufacturing: "Quality checks slow production; ship faster, worry later."
Result: Short-term speed, long-term disaster. Recalls cost 10-100x more than prevention.

2024 AI: "Evaluation pipelines slow iteration; we need to move fast."
Result: Technical debt accumulates, production incidents spike, remediation costs explode.

<div class="knowing-nod"> If your org is debating "Should we build evals or just ship faster?", you're having the exact same conversation Detroit had in 1955. They chose speed. It nearly destroyed the American auto industry. Toyota chose quality. They won. </div>

What quality engineering taught us:
Quality is faster in the long run. Prevention is 10x cheaper than detection, 100x cheaper than correction in production. Invest in quality gates upfront, or pay exponentially more later.

The Forgotten Frameworks (That Still Work)

Here's the irony: The frameworks for building reliable AI systems already exist. We just forgot to apply them.

Framework 1: PDCA for AI Development

Plan:

Define what "good" looks like (acceptance criteria)
Choose evaluation metrics (accuracy, latency, safety)
Set quality thresholds (what's acceptable?)

Do:

Deploy the model/system
Run it with real inputs
Collect performance data

Check:

Compare actual performance to planned thresholds
Identify deviations and drift
Flag anomalies

Act:

For deviations: root cause analysis
Adjust model, data, or prompts
Update quality standards if needed
Repeat cycle

<p class="subtle-callout"> This is exactly what mature AI teams do. They just don't call it PDCA. They call it "our ML ops process." </p>

Framework 2: Six Sigma DMAIC for AI Quality

Define:

What problem are we solving?
What does success look like?
Who are the stakeholders?

Measure:

Baseline performance (before)
Key metrics (accuracy, latency, cost)
Variation sources (where does quality drift?)

Analyze:

What causes failures?
What patterns exist in errors?
What's the root cause?

Improve:

Implement fixes
A/B test changes
Validate improvement

Control:

Monitor in production
Set up alerts for drift
Maintain quality over time

<div class="knowing-nod"> If you've ever run a "model improvement sprint," you've done DMAIC. You measured bad performance, analyzed why, improved the model, and set up monitoring. That's not innovation—that's industrial engineering applied to AI. </div>

Framework 3: Total Quality Management for AI Systems

Principle 1: Customer Focus
Quality is defined by user needs, not engineering metrics. A 99% accurate model that answers the wrong question is worthless.

Principle 2: Process Approach
Quality comes from process, not heroics. Build systematic evaluation into every stage of development, not just at the end.

Principle 3: Continuous Improvement
There's no "done." Monitor, measure, refine, repeat. Forever.

Principle 4: Evidence-Based Decisions
Gut feel and cherry-picked examples don't count. Use data. Use statistics. Be rigorous.

Principle 5: Relationship Management
Quality is a team sport. Engineering, product, ops, and users all have a role.

These aren't new insights. They're from the 1980s. And they work perfectly for AI systems.

The ISO 9001 for AI (That Doesn't Exist, But Should)

ISO 9001 is the international quality standard. It's how you prove your systems are reliable. Aerospace, medical devices, automotive—all require it.

AI has no equivalent.

We have fragmented best practices. We have vendor-specific tools. We have research papers. But we have no standardized quality framework for AI systems.

<p class="aside"> If you've ever tried to answer "How do we know this AI system is production-ready?", you've felt this gap. Manufacturing had clear answers in 1987. We're still winging it in 2024. </p>

Here's what an ISO 9001-style AI quality standard would look like:

1. Documentation Requirements

Document all training data sources and lineage
Document model architecture and hyperparameters
Document evaluation criteria and thresholds
Document deployment process and rollback procedures
Version everything.

2. Measurement Requirements

Define KPIs before deployment
Measure baseline performance
Set acceptable quality ranges
Track metrics continuously
Make it auditable.

3. Monitoring Requirements

Real-time performance monitoring
Drift detection (data, model, concept)
Alert thresholds for quality degradation
Automated health checks
Catch problems before users do.

4. Quality Gates

Pre-deployment evaluation (unit tests for AI)
Staging environment validation
Production canary testing
Go/no-go criteria enforced, not suggested
No shortcuts to production.

5. Continuous Improvement

Regular performance reviews
Root cause analysis for failures
Systematic improvement cycles
Knowledge sharing across teams
Learn from every incident.

6. Audit Trail

Log all model versions deployed
Log all evaluation results
Log all production incidents
Enable forensic analysis
"What happened and when?" must be answerable.

Sound excessive?

This is standard practice in manufacturing, aerospace, and medical devices. These industries can't afford failures.

Neither can AI. But we act like we can.

Why We Forgot: The AGI Distraction

Here's the uncomfortable question: If quality principles are 60 years old and proven, why did AI forget them?

The answer: We've been too busy chasing AGI to build reliable systems.

The AI field has been in a multi-decade race to artificial general intelligence. The focus has been:

"Can we make it smarter?"
"Can we scale it bigger?"
"Can we make it more general?"

Not:

"Can we make it reliable?"
"Can we make it consistent?"
"Can we make it safe?"

<div class="knowing-nod"> Anyone who's watched leadership prioritize "getting to GPT-5" over "making GPT-4 actually work reliably" has seen this priority mismatch. The field rewards capability advances, not reliability advances. We get better models, not more dependable systems. </div>

The result:

We have incredibly powerful AI that's also incredibly unreliable. Like a race car with no brakes. Impressive speed, terrifying in practice.

Manufacturing went through this exact phase in the 1950s. American auto makers prioritized features and output over quality. They built bigger, flashier cars faster than anyone.

Then Toyota showed up with boring, reliable cars—and won.

The quality revolution wasn't about making better cars in terms of features. It was about making reliable cars. Predictable. Consistent. Trustworthy.

AI is due for the same revolution.

The winners won't be whoever builds the smartest model. They'll be whoever builds the most reliable AI systems.

And to do that, we need to remember what the boomers already figured out.

The Practical Application: AI Quality Playbook

Enough history. How do you actually apply 60-year-old quality principles to modern AI?

Step 1: Define Quality (Before You Build)

Manufacturing approach: Write the spec before you start production.

AI approach: Define acceptance criteria before you train the model.

Practical questions:

What does "good" look like? (Not just "accurate"—useful, safe, fair, fast)
What's the minimum acceptable performance?
What failure modes are unacceptable? (Hallucinations? Bias? Crashes?)
Who decides if quality is sufficient? (Not just the ML team)

<p class="subtle-callout"> If you can't answer these questions, you're not ready to build. You're just hoping it'll work. </p>

Step 2: Measure Systematically (Not Ad Hoc)

Manufacturing approach: Statistical Process Control—measure everything, track over time, detect drift.

AI approach: Evaluation pipelines—automated, versioned, continuous.

Practical implementation:

Unit tests for AI (eval datasets, not code tests)
Integration tests (end-to-end workflows)
Regression tests (ensure updates don't break existing behavior)
Performance tests (latency, cost, throughput)
Safety tests (adversarial inputs, edge cases)

Run them:

Before every deployment (quality gate)
Continuously in production (monitoring)
After every incident (regression prevention)

Step 3: Separate Building from Evaluation

Manufacturing approach: Production line ≠ Quality inspection. Different people, different incentives.

AI approach: ML engineering ≠ ML evaluation. The team building the model shouldn't be the only team evaluating it.

Why this matters:

Builders optimize for what they can see. They miss what they're blind to. Independent evaluation catches:

Edge cases builders didn't consider
User needs builders don't understand
Failure modes builders rationalized away

<div class="knowing-nod"> If you've ever had a model pass internal evals beautifully, then fail immediately in user hands, you know why this separation matters. Your team isn't bad—they're just biased. Everyone is. </div>

Step 4: Implement PDCA Loops

Manufacturing approach: Continuous improvement cycles. Small, frequent adjustments based on measurement.

AI approach: Iterative model refinement based on production data.

Practical cycle:

Plan: Set improvement goal (e.g., "Reduce hallucination rate by 20%")
Do: Implement change (prompt engineering, fine-tuning, RAG, etc.)
Check: Measure actual improvement (did hallucinations drop?)
Act: If it worked, deploy; if not, try something else

Frequency: Weekly or bi-weekly. Small iterations beat big rewrites.

Step 5: Root Cause Analysis (Not Symptom Whack-a-Mole)

Manufacturing approach: When quality fails, use structured frameworks to find the real cause.

AI approach: When the model fails, don't just patch the prompt. Find out why.

5 Whys Example:

Incident: Model gave wrong answer to user query

Why? Model didn't have relevant context
Why? RAG retrieval failed
Why? User query was rephrased poorly
Why? Query rewriting prompt was too generic
Why? We didn't test query rewriting separately

Root cause: Lack of modular evaluation. We tested end-to-end but not components.

Fix: Add unit tests for query rewriting. Not just "fix this prompt."

Step 6: Make Quality Everyone's Job

Manufacturing approach: TQM—quality is organization-wide, not just the QA department's problem.

AI approach: Everyone owns reliability—engineering, product, ops, leadership.

Cultural shift:

Engineering: "Did we evaluate this thoroughly?" is as important as "Does it work?"
Product: "What quality metrics matter to users?" drives prioritization
Ops: "Can we detect issues before users do?" is a design requirement
Leadership: "Quality velocity" (sustainable pace) beats "feature velocity" (fast but brittle)

<p class="aside"> If your org celebrates "we shipped 10 features this quarter" but not "we had zero production incidents this quarter," your incentives are broken. </p>

The Business Case: Quality is Profitable

Let's talk ROI, because quality engineering isn't charity—it's strategic.

Cost Avoidance

Prevention cost: $1
Detection cost: $10
Correction cost (in production): $100
Crisis cost (PR/legal/trust): $1,000+

<div class="knowing-nod"> Anyone who's had to pause a product launch because the AI broke in production knows this math viscerally. The eval pipeline you "didn't have time for" would have cost $50K. The production incident cost $2M in engineering time, customer churn, and delayed revenue. </div>

Real example:

Company A: "We'll build evals later, let's ship fast."

Month 1-3: Fast feature velocity, team happy
Month 4: Production incident, 2 weeks to fix
Month 6: Another incident, trust eroding
Month 9: Building the eval infrastructure they should have built in Month 1
Cost: 6 months of technical debt, multiple incidents, deferred revenue

Company B: "We'll build evals upfront."

Month 1: Slower start, building test infrastructure
Month 2-9: Steady velocity, high confidence deployments, zero major incidents
Cost: 2 weeks upfront investment
Savings: 6+ months of firefighting avoided

The math is obvious. We just don't do the math.

Competitive Moat

In 2024, everyone has access to GPT-4, Claude, Gemini. The models are commoditized.

Differentiation isn't model capability. It's system reliability.

The company that can dependably deliver AI value wins. Not the one with the smartest model—the one with the most predictable model.

Think about it:

Would you rather use:

An AI that's brilliant 80% of the time but unpredictable
An AI that's good 95% of the time and you know when it'll fail

Enterprise customers choose reliability over capability. Every. Single. Time.

Quality engineering is your competitive advantage. Boomers figured this out in 1960. We're relearning it in 2024.

Conclusion: Everything Old is New Again

The boomers called it Quality Assurance and Control.
Gen AI calls it Observability and Evaluations.
Tomato, tomahto.

The principles of building reliable systems haven't changed:

Define quality before you build
Measure systematically
Detect issues early
Fix root causes, not symptoms
Improve continuously
Make quality everyone's job

These worked in 1960. They work in 2024.

The AI field spent decades chasing AGI. We built incredible capabilities. But we forgot the basics of reliability engineering.

<p class="aside"> If you've ever felt like you're reinventing the wheel as you build AI evaluation systems, you're not wrong. You are reinventing the wheel. The boomers already invented it. We just forgot to read the manual. </p>

The good news: The manual exists. It's called quality engineering. It's proven. It scales. It works.

The better news: You don't have to start from scratch. Just dust off the frameworks, swap "manufacturing process" for "AI model," and apply them.

PDCA. DMAIC. Six Sigma. ISO 9001. These aren't relics. They're blueprints.

The AI systems that win won't be the ones with the most advanced models. They'll be the ones with the most disciplined quality processes.

The boomers knew this. Time we relearn it.

Appendix: Quick Reference - Quality Engineering for AI

PDCA for AI

Phase	AI Application
Plan	Define eval metrics, acceptance criteria, quality thresholds
Do	Deploy model, run inference, collect data
Check	Compare performance to thresholds, detect drift
Act	Root cause analysis, implement fixes, update standards

DMAIC for Model Improvement

Phase	AI Application
Define	What problem? What success? Which users?
Measure	Baseline metrics, variation sources, failure modes
Analyze	Root cause of failures, error patterns, systemic issues
Improve	Fix implementation, A/B test, validate results
Control	Production monitoring, drift alerts, sustained quality

Quality Checklist for AI Deployment

[ ] Acceptance criteria defined before building
[ ] Evaluation dataset created and versioned
[ ] Automated eval pipeline in place
[ ] Quality gates enforced (cannot deploy without passing)
[ ] Independent evaluation (not just builders)
[ ] Production monitoring configured
[ ] Drift detection alerts set
[ ] Incident response process documented
[ ] Root cause analysis framework adopted
[ ] Continuous improvement cycles scheduled

Cost Comparison

Approach	Upfront Cost	Incident Cost	Total (1 year)
No QA	$0	$500K (5 incidents)	$500K
Reactive QA	$50K	$200K (2 incidents)	$250K
Proactive QA	$100K	$20K (0-1 minor)	$120K

ROI of quality engineering: 4x+

Key Takeaways:

AI observability = Quality control (rebranded)
The frameworks already exist—PDCA, DMAIC, Six Sigma, ISO 9001
Prevention is 10-100x cheaper than correction
Reliability is the new competitive moat (not capability)
Make quality everyone's job, not just ML engineering
The boomers solved this in 1960. Read the manual.

The quality revolution happened 60 years ago. AI is just catching up.

Which company will you be—the one that forgets history, or the one that learns from it?