Chapter 1: The Anatomy of a Failed Data Project #
"Plans are worthless, but planning is everything." -- Dwight D. Eisenhower
20 min read | All personas | Part I: The Problem
What you'll learn:
- How a realistic churn prediction project fails across five team handoffs
- The coordination tax that silently consumes 18% of project time
- Where the 85% failure pattern emerges -- not in algorithms, but in organization
- What the DataSims results show the same project could have achieved
The Problem #
Meet the team at SimShop, a mid-size e-commerce company doing $200M in annual revenue. Customer churn has been climbing -- up 4 percentage points in the last two quarters. The VP of Data, a seasoned executive named David, has secured a $500K budget and six months to build a churn prediction system.
On paper, this project has everything going for it: executive sponsorship, adequate budget, talented people, and a clear business problem. David assembles a team of five:
| Role | Person | Annual Salary | Allocation |
|---|---|---|---|
| Business Analyst | Raj | $115,000 | 50% for 6 months |
| Data Engineer | Priya | $135,000 | 100% for 6 months |
| Data Scientist | Marcus | $140,000 | 100% for 6 months |
| QA Engineer | Sarah | $110,000 | 50% for 6 months |
| MLOps Engineer | Chen | $145,000 | 50% for 4 months |
Total personnel cost: approximately $310,000 over the project timeline. Add infrastructure, tooling licenses, cloud compute, and management overhead, and the $500K budget feels tight but workable.
David is cautiously optimistic. He has done this before.
Six months later, the project is dead.
Not officially dead -- it lingers as a "Phase 2" item on the roadmap. But the model never reaches production. The pipeline runs intermittently. The business case has been quietly shelved. $487,000 has been spent.
What happened?
Handoff 1: Business to Engineering (Weeks 1-4) #
Raj, the business analyst, spends three weeks producing a 47-page Business Requirements Document. It is thorough, well-written, and includes stakeholder interviews, competitive analysis, and success criteria. The key metric is clear: reduce monthly churn rate from 8.2% to 6.5% within 90 days of deployment.
Raj delivers the BRD in a 90-minute presentation. Priya and Marcus attend. They nod. They ask a few questions. They leave with a PDF.
Here is what gets lost in translation:
flowchart LR
subgraph BRD ["Raj's BRD (47 pages)"]
R1["Churn = no purchase AND\nno login for 90 days,\nexcluding seasonal customers"]
R2["Support ticket sentiment\nas feature, weighted by recency"]
R3["Model must be explainable\nper GDPR Article 22"]
end
subgraph ENG ["Priya's Interpretation"]
E1["Customer hasn't ordered\nin 90 days\n(login data missing,\nseasonal adjustment unclear)"]
E2["Add support data later —\nneed NLP pipeline first"]
E3["SHAP values, probably.\nWill figure out format later."]
end
R1 -.->|"context lost"| E1
R2 -.->|"deferred"| E2
R3 -.->|"misunderstood"| E3
Three critical business rules -- the churn definition, the support sentiment feature, and the explainability requirement -- are either simplified, deferred, or misunderstood. Nobody realizes it yet.
- The PDF Handoff. A 47-page document delivered as a PDF is not a specification -- it is a wish list. Without machine-readable acceptance criteria, there is no way to automatically validate whether the implementation matches the requirement.
Handoff 2: Data Engineering to Data Science (Weeks 5-12) #
Priya spends eight weeks building the data pipeline. She is excellent at her job. The pipeline ingests data from four source systems, performs 23 transformations, and produces a feature store with 47 features.
But three problems emerge:
Problem 1: The churn definition is wrong. Priya implements "no purchase in 90 days" without the login activity check or the seasonal exclusion. This means the training data is mislabeled for approximately 12% of customers.
Problem 2: Support ticket data is not integrated. The NLP pipeline for sentiment analysis would take an additional four weeks. Priya decides to defer it and document the gap in a Confluence page that Marcus never reads.
Problem 3: Data quality issues are undiscovered. Three of SimShop's source systems have known quality problems -- NULL customer IDs in the CRM (affecting 3.2% of records), timezone inconsistencies in the event stream, and a duplicate detection gap in the order system. Priya does not know about these because there is no data quality monitoring in place.
flowchart LR
subgraph SRC ["Source Systems"]
A["Orders\n(good)"]
B["CRM\n(3.2% NULL IDs)"]
C["Events\n(TZ bugs)"]
D["Support Tickets\nDEFERRED"]
end
subgraph PIPE ["Priya's Pipeline"]
T["Transform +\nFeature Eng"]
end
subgraph FS ["Feature Store"]
F1["47 features"]
F2["Churn def\nWRONG (12%)"]
F3["Support\nMISSING"]
F4["Quality\nUNKNOWN"]
end
A -->|ETL| T
B -->|ETL| T
C -->|ETL| T
D -.->|"deferred"| FS
T --> FS
Marcus receives the feature store with a Slack message: "Feature store is ready, 47 features, churn label is in the is_churned column. Let me know if you need anything."
He does not know the label is wrong. He does not know the support data is missing. He does not know about the data quality issues.
- The Invisible Handoff Problem. The most dangerous defects in data projects are the ones that cross team boundaries silently. A mislabeled target variable does not throw an error. It trains a model that confidently predicts the wrong thing.
Handoff 3: Data Science to Quality Assurance (Weeks 13-20) #
Marcus trains 14 models over six weeks. His best performer is a gradient-boosted tree achieving an AUC of 0.79 on the test set. He is satisfied -- it clears the 0.75 threshold in the BRD.
Except it does not. Because the BRD actually specifies AUC > 0.80, and that number was calculated assuming the support sentiment feature would be included. Marcus never went back to the BRD. He remembered "something around 0.75-0.80" from Raj's presentation four months ago.
When Sarah, the QA engineer, begins testing in Week 16, she discovers the discrepancy. This triggers the first of many painful meetings.
The Meeting Cascade:
| Week | Meeting | Duration | Attendees | Outcome |
|---|---|---|---|---|
| 16 | "Churn definition alignment" | 2 hours | Raj, Priya, Marcus | Discover the 12% mislabeling |
| 17 | "Pipeline rework planning" | 1.5 hours | Priya, Marcus, David | Decide to fix churn definition |
| 17 | "Support data status review" | 1 hour | Priya, Raj | Agree to defer support data to Phase 2 |
| 18 | "Retrained model review" | 2 hours | Marcus, Sarah, Raj | New AUC is 0.76 -- still below 0.80 |
| 19 | "Scope negotiation" | 3 hours | All + David | Lower AUC threshold to 0.76, cut features |
| 20 | "Go/no-go decision" | 1.5 hours | David + leadership | Conditional go -- but Chen has concerns |
Total meeting time in Weeks 16-20: 11 hours across the team. This does not include the hallway conversations, Slack threads, and email chains.
- Scope Negotiation as Recovery. When a project misses its targets, the most common response is to lower the targets rather than fix the root cause. Lowering AUC from 0.80 to 0.76 does not make the model better -- it makes the stakeholders feel better temporarily.
Handoff 4: QA to Deployment (Weeks 21-24) #
Sarah writes 23 test cases. She has limited context on the business requirements (Raj's BRD) and limited understanding of the model internals (Marcus's notebook). Her tests focus on what she can observe: API response times, data format validation, and basic accuracy checks.
She does not test:
- Whether the churn definition matches the BRD
- Whether the model is explainable per GDPR Article 22
- Whether the feature pipeline handles timezone edge cases
- Whether the model performs equitably across customer segments
These are not oversights. Sarah is a competent QA engineer. She simply does not have the context to write these tests, and nobody gave it to her.
Chen, the MLOps engineer, begins deployment work in Week 21. He discovers that:
- The model expects features that the production pipeline does not yet generate
- The serving infrastructure needs GPU instances that are not budgeted
- There is no monitoring defined for model drift
- There is no rollback procedure documented
- ✓ Model artifact exists
- ✓ API endpoint defined
- ✗ Production pipeline matches training — MISMATCH
- ✗ Serving infrastructure provisioned — NOT BUDGETED
- ✗ Drift monitoring configured — NOT DEFINED
- ✗ Rollback procedure documented — NOT WRITTEN
- ✗ GDPR explainability implemented — FORGOTTEN
- ✗ Canary deployment plan — NOT PLANNED
- ✗ On-call runbook — NOT WRITTEN
Handoff 5: The Final Collapse (Weeks 24-26) #
By Week 24, the project is in crisis mode. The model is deployed to a staging environment but cannot serve predictions because the feature pipeline has a training-serving skew -- features computed differently in batch (training) versus real-time (serving).
David calls an all-hands meeting. The options are:
- Extend the project by 8 weeks to fix the pipeline, add monitoring, implement explainability, and do proper testing. Cost: additional $130,000.
- Deploy as-is with known gaps. Risk: GDPR non-compliance, silent model degradation, no rollback capability.
- Shelve the project and revisit in Q3 with a different approach.
David chooses option 3. The project is shelved. $487,000 has been spent. Zero business value delivered.
The Coordination Tax #
Looking back at the project timeline, one pattern dominates: coordination overhead.
gantt
title SimShop Churn Project — 26 Week Timeline
dateFormat YYYY-MM-DD
axisFormat Wk %V
section Phases
Requirements (4 wk) :req, 2024-01-01, 28d
Data Engineering (8 wk) :de, 2024-01-29, 56d
Data Science (6 wk) :ds, 2024-03-25, 42d
QA (4 wk) :qa, 2024-05-06, 28d
Deploy attempt (3 wk) :dep, 2024-06-03, 21d
section Crisis Points
Requirements gap found :crit, 2024-01-22, 7d
Pipeline design issue :crit, 2024-02-19, 7d
Feature store handoff :crit, 2024-03-18, 7d
Churn definition crisis :crit, 2024-04-15, 7d
Scope negotiation :crit, 2024-05-06, 14d
Project shelved :crit, 2024-06-10, 7d
- Total multi-person meetings: ~26 hours
- At blended rate of $75/hr × 3 avg attendees: ~$5,850 in meeting time alone
- Plus context switching, Slack, email: estimated 18% of total project hours
- Key crises: requirements review (Wk 4), churn definition crisis (Wk 16), scope negotiation (Wk 19), project decision (Wk 24)
The 18% coordination tax is not unique to SimShop. Research from the Standish Group CHAOS Report consistently shows that communication and coordination failures are the leading cause of project failure -- ahead of technology, requirements, and skills.
- The Coordination Tax Compounds. Each handoff failure does not just cost the time to fix it. It costs the time to discover the failure, the time to coordinate the fix across teams, the time to re-test downstream artifacts, and the morale cost of rework. A single mislabeled column can cascade into weeks of wasted effort.
The Pattern: Why 85% Fail #
SimShop's project is not unusual. It is the median outcome for data and ML projects across industry. The pattern has five consistent elements:
- Lossy Handoffs — Requirements lose nuance at every boundary: BA → DE → DS, with context lost at each step
- Deferred Quality — "We'll test it later" means "we won't test it"
- Invisible Defects — Data bugs don't throw exceptions
- Sequential Dependency — Can't test until built, can't deploy until tested
- Scope Erosion — Requirements shrink to fit what was actually built
These five elements interact. Lossy handoffs create invisible defects. Invisible defects are discovered late because quality is deferred. Late discovery forces rework that breaks sequential dependencies. And when the timeline is blown, scope erodes to fit what remains.
This is not a management failure or a skills failure. It is a structural failure -- an inevitable consequence of organizing data projects around sequential human handoffs without machine-enforceable quality gates.
Industry Perspective #
SimShop's story maps directly to findings from multiple industry studies:
Gartner (2024): 85% of AI/ML projects fail to reach production. The leading causes are data quality issues (cited by 67% of respondents), organizational silos (54%), and lack of clear success criteria (48%).
Standish Group CHAOS Report (2023): Agile projects have a 42% success rate, Waterfall projects 13%. But data/ML projects are worse than both because they combine technical complexity with cross-functional coordination requirements that neither methodology was designed to handle.
Anaconda State of Data Science (2023): Data scientists report spending 45% of their time on data preparation and cleaning -- time that produces no model value but is consumed by the gap between engineering and science teams.
McKinsey (2023): Organizations that successfully deploy ML at scale share three characteristics: cross-functional teams (not silos), automated quality checks (not manual testing), and clear ownership models (not shared accountability, which means no accountability).
- Map Your Own Project. Take your most recent data/ML project and diagram the handoffs. For each handoff, ask: What context was lost? What assumptions were made? What was deferred? You will likely find at least three of the five failure pattern elements.
The Evidence: What Could Have Been #
Here is what makes SimShop's story not just a cautionary tale but a solvable problem.
We took the same churn prediction project -- the same data, the same business requirements, the same success criteria -- and ran it through the Neam data intelligence agent stack on the DataSims platform. The results:
| Metric | SimShop (Manual Team) | Neam Agent Stack |
|---|---|---|
| Total cost | $487,000 spent ($500K budget) | $34,700 (API + compute) |
| Duration | 24 weeks (then shelved) | Hours (simulation time) |
| Phases completed | 3 of 7 (requirements, engineering, partial science) | 7/7 (all phases) |
| Model AUC | 0.76 (below 0.80 target) | 0.847 (exceeds target) |
| Churn definition | Wrong (12% mislabeled) | Correct (from BRD specs) |
| Support features | Deferred | Included |
| Test coverage | 23 manual tests | 47 auto-generated tests (94%) |
| Quality gates | None | Formal gates, all passed |
| GDPR compliance | Forgotten | Built into specs |
| Reproducibility | Unknown | 100% (50/50 runs) |
The cost difference is staggering: $34,700 versus $548,000 (the full budget equivalent including rework and incidents). That is a 93.7% reduction. But cost is not even the most important difference.
The most important difference is completeness. The manual team completed 3 of 7 lifecycle phases. The agent stack completed all 7: requirements analysis, data engineering, model training, causal analysis, quality testing, deployment, and monitoring setup. And every phase was connected -- the acceptance criteria from the Data-BA Agent were used by the DataTest Agent to generate test cases. The churn definition from the BRD was used by the ETL Agent to correctly label the target variable. The GDPR requirement was traced from spec to implementation to test.
No handoff failures. No lossy compression. No deferred quality.
- BA → (PDF, lossy) → DE → (Slack, lossy) → DS
- DS → (notebook, incomplete) → QA
- QA → (manual, no context) → DevOps
- Result: 3/7 phases, AUC 0.76
- Data-BA → (machine-readable spec) → ETL Agent
- DataScientist → (same specs) → DataTest (auto-generates tests from acceptance criteria)
- MLOps Agent → canary deploy + monitoring → Production
- Result: 7/7 phases, AUC 0.847
These are not theoretical projections. They are experimental results from the DataSims evaluation environment -- a containerized simulation with 164 database tables, 12 schemas, and 10 controlled data quality issues. The experiment was run 50 times across 10 conditions. Every run completed successfully.
You can reproduce these results yourself. The entire environment is open-source at github.com/neam-lang/Data-Sims.
Key Takeaways #
- The 85% failure rate is structural, not accidental. It emerges from organizing data projects around sequential human handoffs without machine-enforceable quality gates.
- Five elements drive failure: lossy handoffs, deferred quality, invisible defects, sequential dependencies, and scope erosion. Most failed projects exhibit at least three of the five.
- The coordination tax is real and measurable. At SimShop, 18% of project time was consumed by meetings, context-switching, and rework coordination -- and this is below industry average.
- The same project, run through spec-driven agents, achieves dramatically better outcomes. DataSims demonstrates: $34.7K vs $548K cost, 7/7 phases completed (vs 3/7), AUC of 0.847 (vs 0.76), and 100% reproducibility.
- The solution is not better people or better tools. It is a fundamentally different organizational model -- one where specifications drive execution and quality gates enforce standards at every boundary.
For Further Exploration #
- Neam: The AI-Native Programming Language -- The language documentation for the agent system
- DataSims Repository -- Clone the SimShop environment and run the churn prediction experiment
- Gartner, "Why Do 85% of Machine Learning Projects Fail?" (2024)
- Standish Group, CHAOS Report (2023) -- Project success rate benchmarks
- Anaconda, State of Data Science Report (2023) -- Time allocation data for data scientists