Chapter 1: The Anatomy of a Failed Data Project #

"Plans are worthless, but planning is everything." -- Dwight D. Eisenhower


20 min read | All personas | Part I: The Problem

What you'll learn:


The Problem #

Meet the team at SimShop, a mid-size e-commerce company doing $200M in annual revenue. Customer churn has been climbing -- up 4 percentage points in the last two quarters. The VP of Data, a seasoned executive named David, has secured a $500K budget and six months to build a churn prediction system.

On paper, this project has everything going for it: executive sponsorship, adequate budget, talented people, and a clear business problem. David assembles a team of five:

RolePersonAnnual SalaryAllocation
Business AnalystRaj$115,00050% for 6 months
Data EngineerPriya$135,000100% for 6 months
Data ScientistMarcus$140,000100% for 6 months
QA EngineerSarah$110,00050% for 6 months
MLOps EngineerChen$145,00050% for 4 months

Total personnel cost: approximately $310,000 over the project timeline. Add infrastructure, tooling licenses, cloud compute, and management overhead, and the $500K budget feels tight but workable.

David is cautiously optimistic. He has done this before.

Six months later, the project is dead.

Not officially dead -- it lingers as a "Phase 2" item on the roadmap. But the model never reaches production. The pipeline runs intermittently. The business case has been quietly shelved. $487,000 has been spent.

What happened?

Handoff 1: Business to Engineering (Weeks 1-4) #

Raj, the business analyst, spends three weeks producing a 47-page Business Requirements Document. It is thorough, well-written, and includes stakeholder interviews, competitive analysis, and success criteria. The key metric is clear: reduce monthly churn rate from 8.2% to 6.5% within 90 days of deployment.

Raj delivers the BRD in a 90-minute presentation. Priya and Marcus attend. They nod. They ask a few questions. They leave with a PDF.

Here is what gets lost in translation:

DIAGRAM Requirements Lost in Translation
flowchart LR
  subgraph BRD ["Raj's BRD (47 pages)"]
    R1["Churn = no purchase AND\nno login for 90 days,\nexcluding seasonal customers"]
    R2["Support ticket sentiment\nas feature, weighted by recency"]
    R3["Model must be explainable\nper GDPR Article 22"]
  end
  subgraph ENG ["Priya's Interpretation"]
    E1["Customer hasn't ordered\nin 90 days\n(login data missing,\nseasonal adjustment unclear)"]
    E2["Add support data later —\nneed NLP pipeline first"]
    E3["SHAP values, probably.\nWill figure out format later."]
  end
  R1 -.->|"context lost"| E1
  R2 -.->|"deferred"| E2
  R3 -.->|"misunderstood"| E3

Three critical business rules -- the churn definition, the support sentiment feature, and the explainability requirement -- are either simplified, deferred, or misunderstood. Nobody realizes it yet.

Anti-Pattern

- The PDF Handoff. A 47-page document delivered as a PDF is not a specification -- it is a wish list. Without machine-readable acceptance criteria, there is no way to automatically validate whether the implementation matches the requirement.

Handoff 2: Data Engineering to Data Science (Weeks 5-12) #

Priya spends eight weeks building the data pipeline. She is excellent at her job. The pipeline ingests data from four source systems, performs 23 transformations, and produces a feature store with 47 features.

But three problems emerge:

Problem 1: The churn definition is wrong. Priya implements "no purchase in 90 days" without the login activity check or the seasonal exclusion. This means the training data is mislabeled for approximately 12% of customers.

Problem 2: Support ticket data is not integrated. The NLP pipeline for sentiment analysis would take an additional four weeks. Priya decides to defer it and document the gap in a Confluence page that Marcus never reads.

Problem 3: Data quality issues are undiscovered. Three of SimShop's source systems have known quality problems -- NULL customer IDs in the CRM (affecting 3.2% of records), timezone inconsistencies in the event stream, and a duplicate detection gap in the order system. Priya does not know about these because there is no data quality monitoring in place.

DIAGRAM Pipeline Data Flow and Hidden Defects
flowchart LR
  subgraph SRC ["Source Systems"]
    A["Orders\n(good)"]
    B["CRM\n(3.2% NULL IDs)"]
    C["Events\n(TZ bugs)"]
    D["Support Tickets\nDEFERRED"]
  end
  subgraph PIPE ["Priya's Pipeline"]
    T["Transform +\nFeature Eng"]
  end
  subgraph FS ["Feature Store"]
    F1["47 features"]
    F2["Churn def\nWRONG (12%)"]
    F3["Support\nMISSING"]
    F4["Quality\nUNKNOWN"]
  end
  A -->|ETL| T
  B -->|ETL| T
  C -->|ETL| T
  D -.->|"deferred"| FS
  T --> FS

Marcus receives the feature store with a Slack message: "Feature store is ready, 47 features, churn label is in the is_churned column. Let me know if you need anything."

He does not know the label is wrong. He does not know the support data is missing. He does not know about the data quality issues.

Insight

- The Invisible Handoff Problem. The most dangerous defects in data projects are the ones that cross team boundaries silently. A mislabeled target variable does not throw an error. It trains a model that confidently predicts the wrong thing.

Handoff 3: Data Science to Quality Assurance (Weeks 13-20) #

Marcus trains 14 models over six weeks. His best performer is a gradient-boosted tree achieving an AUC of 0.79 on the test set. He is satisfied -- it clears the 0.75 threshold in the BRD.

Except it does not. Because the BRD actually specifies AUC > 0.80, and that number was calculated assuming the support sentiment feature would be included. Marcus never went back to the BRD. He remembered "something around 0.75-0.80" from Raj's presentation four months ago.

When Sarah, the QA engineer, begins testing in Week 16, she discovers the discrepancy. This triggers the first of many painful meetings.

The Meeting Cascade:

WeekMeetingDurationAttendeesOutcome
16"Churn definition alignment"2 hoursRaj, Priya, MarcusDiscover the 12% mislabeling
17"Pipeline rework planning"1.5 hoursPriya, Marcus, DavidDecide to fix churn definition
17"Support data status review"1 hourPriya, RajAgree to defer support data to Phase 2
18"Retrained model review"2 hoursMarcus, Sarah, RajNew AUC is 0.76 -- still below 0.80
19"Scope negotiation"3 hoursAll + DavidLower AUC threshold to 0.76, cut features
20"Go/no-go decision"1.5 hoursDavid + leadershipConditional go -- but Chen has concerns

Total meeting time in Weeks 16-20: 11 hours across the team. This does not include the hallway conversations, Slack threads, and email chains.

Anti-Pattern

- Scope Negotiation as Recovery. When a project misses its targets, the most common response is to lower the targets rather than fix the root cause. Lowering AUC from 0.80 to 0.76 does not make the model better -- it makes the stakeholders feel better temporarily.

Handoff 4: QA to Deployment (Weeks 21-24) #

Sarah writes 23 test cases. She has limited context on the business requirements (Raj's BRD) and limited understanding of the model internals (Marcus's notebook). Her tests focus on what she can observe: API response times, data format validation, and basic accuracy checks.

She does not test:

These are not oversights. Sarah is a competent QA engineer. She simply does not have the context to write these tests, and nobody gave it to her.

Chen, the MLOps engineer, begins deployment work in Week 21. He discovers that:

Deployment Readiness Checklist — Score: 2/9
  • ✓ Model artifact exists
  • ✓ API endpoint defined
  • ✗ Production pipeline matches training — MISMATCH
  • ✗ Serving infrastructure provisioned — NOT BUDGETED
  • ✗ Drift monitoring configured — NOT DEFINED
  • ✗ Rollback procedure documented — NOT WRITTEN
  • ✗ GDPR explainability implemented — FORGOTTEN
  • ✗ Canary deployment plan — NOT PLANNED
  • ✗ On-call runbook — NOT WRITTEN

Handoff 5: The Final Collapse (Weeks 24-26) #

By Week 24, the project is in crisis mode. The model is deployed to a staging environment but cannot serve predictions because the feature pipeline has a training-serving skew -- features computed differently in batch (training) versus real-time (serving).

David calls an all-hands meeting. The options are:

  1. Extend the project by 8 weeks to fix the pipeline, add monitoring, implement explainability, and do proper testing. Cost: additional $130,000.
  2. Deploy as-is with known gaps. Risk: GDPR non-compliance, silent model degradation, no rollback capability.
  3. Shelve the project and revisit in Q3 with a different approach.

David chooses option 3. The project is shelved. $487,000 has been spent. Zero business value delivered.

The Coordination Tax #

Looking back at the project timeline, one pattern dominates: coordination overhead.

DIAGRAM Project Timeline and Coordination Overhead
gantt
  title SimShop Churn Project — 26 Week Timeline
  dateFormat YYYY-MM-DD
  axisFormat Wk %V
  section Phases
    Requirements (4 wk)      :req,  2024-01-01, 28d
    Data Engineering (8 wk)  :de,   2024-01-29, 56d
    Data Science (6 wk)      :ds,   2024-03-25, 42d
    QA (4 wk)                :qa,   2024-05-06, 28d
    Deploy attempt (3 wk)    :dep,  2024-06-03, 21d
  section Crisis Points
    Requirements gap found   :crit, 2024-01-22, 7d
    Pipeline design issue    :crit, 2024-02-19, 7d
    Feature store handoff    :crit, 2024-03-18, 7d
    Churn definition crisis  :crit, 2024-04-15, 7d
    Scope negotiation        :crit, 2024-05-06, 14d
    Project shelved          :crit, 2024-06-10, 7d
Coordination Cost Summary
  • Total multi-person meetings: ~26 hours
  • At blended rate of $75/hr × 3 avg attendees: ~$5,850 in meeting time alone
  • Plus context switching, Slack, email: estimated 18% of total project hours
  • Key crises: requirements review (Wk 4), churn definition crisis (Wk 16), scope negotiation (Wk 19), project decision (Wk 24)

The 18% coordination tax is not unique to SimShop. Research from the Standish Group CHAOS Report consistently shows that communication and coordination failures are the leading cause of project failure -- ahead of technology, requirements, and skills.

Insight

- The Coordination Tax Compounds. Each handoff failure does not just cost the time to fix it. It costs the time to discover the failure, the time to coordinate the fix across teams, the time to re-test downstream artifacts, and the morale cost of rework. A single mislabeled column can cascade into weeks of wasted effort.

The Pattern: Why 85% Fail #

SimShop's project is not unusual. It is the median outcome for data and ML projects across industry. The pattern has five consistent elements:

The 85% Failure Pattern
  • Lossy Handoffs — Requirements lose nuance at every boundary: BA → DE → DS, with context lost at each step
  • Deferred Quality — "We'll test it later" means "we won't test it"
  • Invisible Defects — Data bugs don't throw exceptions
  • Sequential Dependency — Can't test until built, can't deploy until tested
  • Scope Erosion — Requirements shrink to fit what was actually built

These five elements interact. Lossy handoffs create invisible defects. Invisible defects are discovered late because quality is deferred. Late discovery forces rework that breaks sequential dependencies. And when the timeline is blown, scope erodes to fit what remains.

This is not a management failure or a skills failure. It is a structural failure -- an inevitable consequence of organizing data projects around sequential human handoffs without machine-enforceable quality gates.

Industry Perspective #

SimShop's story maps directly to findings from multiple industry studies:

Gartner (2024): 85% of AI/ML projects fail to reach production. The leading causes are data quality issues (cited by 67% of respondents), organizational silos (54%), and lack of clear success criteria (48%).

Standish Group CHAOS Report (2023): Agile projects have a 42% success rate, Waterfall projects 13%. But data/ML projects are worse than both because they combine technical complexity with cross-functional coordination requirements that neither methodology was designed to handle.

Anaconda State of Data Science (2023): Data scientists report spending 45% of their time on data preparation and cleaning -- time that produces no model value but is consumed by the gap between engineering and science teams.

McKinsey (2023): Organizations that successfully deploy ML at scale share three characteristics: cross-functional teams (not silos), automated quality checks (not manual testing), and clear ownership models (not shared accountability, which means no accountability).

Try It

- Map Your Own Project. Take your most recent data/ML project and diagram the handoffs. For each handoff, ask: What context was lost? What assumptions were made? What was deferred? You will likely find at least three of the five failure pattern elements.

The Evidence: What Could Have Been #

Here is what makes SimShop's story not just a cautionary tale but a solvable problem.

We took the same churn prediction project -- the same data, the same business requirements, the same success criteria -- and ran it through the Neam data intelligence agent stack on the DataSims platform. The results:

MetricSimShop (Manual Team)Neam Agent Stack
Total cost$487,000 spent ($500K budget)$34,700 (API + compute)
Duration24 weeks (then shelved)Hours (simulation time)
Phases completed3 of 7 (requirements, engineering, partial science)7/7 (all phases)
Model AUC0.76 (below 0.80 target)0.847 (exceeds target)
Churn definitionWrong (12% mislabeled)Correct (from BRD specs)
Support featuresDeferredIncluded
Test coverage23 manual tests47 auto-generated tests (94%)
Quality gatesNoneFormal gates, all passed
GDPR complianceForgottenBuilt into specs
ReproducibilityUnknown100% (50/50 runs)

The cost difference is staggering: $34,700 versus $548,000 (the full budget equivalent including rework and incidents). That is a 93.7% reduction. But cost is not even the most important difference.

The most important difference is completeness. The manual team completed 3 of 7 lifecycle phases. The agent stack completed all 7: requirements analysis, data engineering, model training, causal analysis, quality testing, deployment, and monitoring setup. And every phase was connected -- the acceptance criteria from the Data-BA Agent were used by the DataTest Agent to generate test cases. The churn definition from the BRD was used by the ETL Agent to correctly label the target variable. The GDPR requirement was traced from spec to implementation to test.

No handoff failures. No lossy compression. No deferred quality.

Traditional Team
  • BA → (PDF, lossy) → DE → (Slack, lossy) → DS
  • DS → (notebook, incomplete) → QA
  • QA → (manual, no context) → DevOps
  • Result: 3/7 phases, AUC 0.76
Neam Agent Stack
  • Data-BA → (machine-readable spec) → ETL Agent
  • DataScientist → (same specs) → DataTest (auto-generates tests from acceptance criteria)
  • MLOps Agent → canary deploy + monitoring → Production
  • Result: 7/7 phases, AUC 0.847

These are not theoretical projections. They are experimental results from the DataSims evaluation environment -- a containerized simulation with 164 database tables, 12 schemas, and 10 controlled data quality issues. The experiment was run 50 times across 10 conditions. Every run completed successfully.

You can reproduce these results yourself. The entire environment is open-source at github.com/neam-lang/Data-Sims.

Key Takeaways #

For Further Exploration #