Chapter 28 — The Road Ahead: From Demonstration to Production #
"The best way to predict the future is to invent it." -- Alan Kay
📖 15 min read | 👤 All personas | 🏷️ Part VIII: Vision
What you'll learn:
- What must happen to move from simulated environments to production deployments
- Six research frontiers that will shape the next generation of data intelligence
- The vision for the Spec-Driven Development movement
- How you can contribute
Where We Stand #
Over 27 chapters, we have built up a complete picture: the problem (85% failure rate), the architecture (14 agents + 1 orchestrator), the evidence (DataSims ablation study), and the results (93.7% cost reduction, 90.6% risk reduction, 100% reproducibility).
But honesty demands that we acknowledge what we have not yet proven. The DataSims environment is a simulation. The cost comparisons are modeled. The agents are orchestrated by LLMs, which means they inherit the strengths -- and the limitations -- of the underlying models.
This chapter maps the path from demonstration to production. It identifies six research frontiers where work is needed, and it invites the community to participate.
Frontier 1: Production Runtime Engines #
The DataSims experiments ran on a single machine with simulated data. Production deployments demand a different runtime:
| Dimension | Current State | Future State |
|---|---|---|
| Infrastructure | Single machine | Distributed cluster |
| Data | Simulated data | Real production data |
| Scale | 5 agent crew | 14+ agents at scale |
| Duration | Minutes runtime | Continuous operation |
| Cost | $23.50 per run | Enterprise-grade SLAs |
What Is Needed #
Distributed agent execution. In production, agents need to run across multiple machines, with fault tolerance and load balancing. The DIO must coordinate agents that may be in different data centers, operating on different schedules, with different latency characteristics.
Persistent state management. The current checkpoint system works for single-run experiments. Production systems need persistent state that survives process restarts, infrastructure failures, and version upgrades.
Real-time event processing. The current batch-oriented architecture processes complete tasks. Production data platforms generate continuous streams of events -- schema changes, data arrivals, drift alerts -- that agents must react to in real time.
flowchart TB
subgraph CP["CONTROL PLANE"]
DIO["DIO Scheduler"]
AR["Agent Registry"]
SS["State Store"]
end
subgraph DP["DATA PLANE"]
AN1["Agent Node 1"]
AN2["Agent Node 2"]
AN3["Agent Node 3"]
ANN["Agent Node N"]
end
CP --> DP
Frontier 2: Human Evaluation Studies #
The DataSims experiments measure system performance against automated metrics. What they do not measure is how humans interact with the system:
Questions That Need Answers #
- Trust: Do data engineers trust agent-generated pipelines enough to deploy them?
- Collaboration: How do data scientists collaborate with agents versus replacing them?
- Onboarding: How long does it take a new team member to become productive with spec-driven development?
- Error recovery: When agents make mistakes, how effectively can humans diagnose and fix the issue?
- Organizational change: What resistance patterns emerge when introducing agent-assisted workflows?
Study Design #
| Participants | 30 data professionals (10 engineers, 10 scientists, 10 analysts) |
| Groups | A (manual workflow), B (Neam agents), C (Neam + human-in-loop) |
| Tasks | 3 data lifecycle tasks of increasing complexity |
| Measures | Completion time, quality score, trust survey, NASA-TLX workload |
| Duration | 2 weeks per participant |
| Outcome | Quantified human factors data for spec-driven development |
This study would produce the first empirical evidence of how spec-driven development affects human data practitioners -- not just the automated metrics, but the subjective experience of working alongside intelligent agents.
Frontier 3: Open-Source LLM Evaluation #
The current DataSims experiments use OpenAI models (GPT-4o, o3-mini). A critical research question: how does agent quality change with different LLM backbones?
Proposed Evaluation Matrix #
| LLM Provider | Model | Cost/1K tokens | Status |
|---|---|---|---|
| OpenAI | GPT-4o | $0.0025-0.01 | Tested |
| OpenAI | o3-mini | $0.0011-0.0044 | Tested |
| Anthropic | Claude 3.5 | $0.003-0.015 | Planned |
| Meta | Llama 3.1 405B | Self-hosted | Planned |
| Mistral | Large 2 | $0.002-0.006 | Planned |
| Gemini 2.5 Pro | $0.00125-0.01 | Planned | |
| Alibaba | Qwen 2.5 | Self-hosted | Planned |
Research Questions #
- Quality floor: What is the minimum LLM capability required for each agent type?
- Cost-quality tradeoff: Can smaller, cheaper models handle simpler tasks (e.g., Data-BA) while reserving expensive models for complex tasks (e.g., Causal)?
- Self-hosted viability: Can organizations run the entire agent stack on-premises with open-source models?
- Provider diversity: Does using multiple providers improve reliability through failover?
Neam's provider-agnostic architecture makes this evaluation straightforward -- change the provider and model fields in the agent declaration, re-run the experiments, and compare CES scores.
Frontier 4: Multi-Domain Validation #
SimShop is an e-commerce company. The question remains: does spec-driven development generalize to other domains?
Proposed Domains #
| Domain | Characteristics | Key Challenges |
|---|---|---|
| Healthcare | HIPAA compliance, patient data, clinical models | Regulatory strictness, data sensitivity |
| Finance | Real-time trading data, risk models, SOX compliance | Latency requirements, model explainability |
| Manufacturing | IoT sensor data, predictive maintenance, supply chain | Time-series at scale, edge computing |
| Government | Census data, policy analysis, public accountability | Data sovereignty, transparency requirements |
| Telecommunications | Network logs, customer behavior, 5G data volumes | Volume scale, real-time processing |
What Multi-Domain Validation Requires #
For each domain, we need:
- A domain-specific DataSims environment -- equivalent to SimShop but with healthcare, finance, or manufacturing data schemas
- Domain-specific Agent.MD files -- encoding the institutional knowledge of each industry
- Domain-specific quality issues -- healthcare has missing diagnoses; finance has late trade confirmations; manufacturing has sensor drift
- Domain-specific compliance requirements -- HIPAA, SOX, ISO 27001, GDPR variants
flowchart LR A["2026 Q2\nSimShop\n(e-commerce)\nCOMPLETE"] --> B["2026 Q3\nFinSim\n(financial services)\nIN DESIGN"] B --> C["2026 Q4\nHealthSim\n(healthcare)\nPLANNED"] C --> D["2027 Q1\nMfgSim\n(manufacturing)\nPLANNED"] D --> E["2027 Q2\nCross-domain\ncomparison\nPLANNED"]
Frontier 5: Distributed DataSims #
The current DataSims runs on a single Docker Compose stack. Production enterprises have distributed data architectures:
flowchart TB
subgraph CURRENT["CURRENT: Single-Node DataSims"]
DC["Docker Compose (1 machine)\nAll 10 services co-located"]
end
subgraph FUTURE["FUTURE: Distributed DataSims"]
subgraph US["Region: US"]
US_PG["PostgreSQL"]
US_ML["MLflow"]
US_AG["Agents"]
end
subgraph EU["Region: EU"]
EU_PG["PostgreSQL"]
EU_ML["MLflow"]
EU_AG["Agents"]
end
subgraph APAC["Region: APAC"]
APAC_PG["PostgreSQL"]
APAC_ML["MLflow"]
APAC_AG["Agents"]
end
GDIO["Global DIO Coordination"]
US --> GDIO
EU --> GDIO
APAC --> GDIO
end
Research Questions for Distributed DataSims #
- Data sovereignty: How do agents handle data that cannot leave a geographic region?
- Latency: How does cross-region agent coordination affect end-to-end completion time?
- Consistency: How do distributed agents maintain consistent state when network partitions occur?
- Federation: Can multiple organizations share agents without sharing data?
Frontier 6: Cross-Organization Agent.MD Sharing #
Agent.MD files encode domain knowledge. Currently, each organization writes its own. But much of this knowledge is not proprietary -- it is industry best practice:
| Category | Examples |
|---|---|
| Organization-Specific | Table schemas, Column naming conventions, Business rules, Data quality thresholds |
| Industry-Shared | Common churn prediction approaches, Standard feature engineering patterns, Regulatory compliance checklists, Deployment best practices |
The Vision: Agent.MD Marketplace #
flowchart TB
subgraph MP["Agent.MD Marketplace"]
subgraph EC["E-Commerce"]
EC1["Churn prediction"]
EC2["Rec engine"]
EC3["LTV model"]
end
subgraph HC["Healthcare"]
HC1["Patient risk"]
HC2["Readmission"]
HC3["Drug interaction"]
end
subgraph FN["Finance"]
FN1["Fraud detection"]
FN2["Credit risk"]
FN3["AML"]
end
subgraph MF["Manufacturing"]
MF1["Predictive maintenance"]
MF2["Quality control"]
end
end
MP --- NOTE["Community-contributed, peer-reviewed,\nversion-controlled domain knowledge"]
An Agent.MD marketplace would accelerate adoption by letting new organizations start with proven, peer-reviewed domain knowledge rather than encoding everything from scratch. Organizations would contribute back their refinements, creating a positive feedback loop.
The Spec-Driven Development Movement #
This book has argued for a specific thesis: the bottleneck in data engineering is not code generation but understanding what to build, why, and how to validate it. Spec-driven development addresses this bottleneck by encoding human expertise in structured, machine-readable specifications that agents execute within defined boundaries.
This is not just a technology choice. It is a philosophical position about the relationship between humans and AI agents in the data lifecycle:
- SPECIFICATIONS, NOT PROMPTS
Human expertise is encoded in structured specs, not ad-hoc prompts. Specs are versioned, reviewed, and auditable. - AGENTS WITHIN BOUNDARIES
Agents execute within the spec's boundaries. They can be creative within those boundaries but cannot exceed them without escalation. - QUALITY GATES, NOT TRUST
We do not trust agents to produce correct output. We verify. Every phase has quality gates that must be passed. - TRACEABILITY, NOT OBSERVABILITY
It is not enough to observe what agents do. We must trace WHY they did it, back to the requirement that justified the action. - COST AS A FIRST-CLASS CONSTRAINT
Budget is not an afterthought. It is an architectural constraint that shapes agent behavior and prevents runaway spending.
What Success Looks Like #
If spec-driven development succeeds, the data industry will look different in five years:
- Data projects will have a 80%+ production success rate, up from 15% today
- Time-to-production will be measured in days, not months
- Every deployed model will have complete traceability from business requirement to production monitoring
- Data teams will be smaller but more effective, focusing on specification quality rather than code volume
- Reproducibility will be standard, not exceptional
These are ambitious claims. The DataSims results provide early evidence. The research frontiers outlined in this chapter describe the work needed to validate them at scale.
How You Can Contribute #
The Neam ecosystem is open source. Here is how different personas can contribute:
| Persona | Contribution |
|---|---|
| Data Engineers | Build domain-specific DataSims environments for new industries |
| Data Scientists | Evaluate the agent stack on novel problem types beyond churn prediction |
| Researchers | Design and conduct human evaluation studies |
| ML Engineers | Test open-source LLM backbones and contribute performance benchmarks |
| Business Analysts | Create Agent.MD templates for common business domains |
| VP/Directors | Pilot spec-driven development in your organization and share results |
Getting Started #
# Clone the DataSims repository
git clone https://github.com/neam-lang/Data-Sims.git
# Set up the environment
cd Data-Sims
./scripts/setup.sh small
# Run the experiments
python3 evaluation/run_experiments.py
# Read the results
cat evaluation/reports/experiment_report.md
Every contribution -- a new ablation experiment, a domain-specific Agent.MD, a human evaluation study, an open-source LLM benchmark -- brings us closer to solving the 85% problem.
A Final Thought #
We started this book with a number: 85%. Eighty-five percent of machine learning projects fail to reach production. Not because the algorithms are wrong. Not because the infrastructure is lacking. Because the organizational machinery that translates business intent into production systems is fundamentally broken.
Spec-driven development is our answer. It is not the only answer, and it is not yet a complete answer. But it is an answer backed by evidence, implemented in code, and open for the world to evaluate and improve.
The 85% problem is real. The solution space is open. And now you have the tools to contribute.
Let us build.
Praveen Govindaraj Creator of Neam March 2026
Key Takeaways #
- Six research frontiers stand between demonstration and production: runtime engines, human evaluation, open-source LLMs, multi-domain validation, distributed environments, and Agent.MD sharing
- Production runtime requires distributed agent execution, persistent state, and real-time event processing
- Human evaluation studies will quantify trust, collaboration, and onboarding for spec-driven development
- Multi-domain validation (healthcare, finance, manufacturing) will test generalizability beyond e-commerce
- The Agent.MD marketplace vision enables knowledge sharing across organizations
- The Spec-Driven Development movement rests on five principles: specs not prompts, agents within boundaries, quality gates not trust, traceability not observability, cost as a first-class constraint
- The entire ecosystem is open source and invites community contribution
For Further Exploration #
- DataSims Repository -- Clone, run, contribute
- Neam Language Documentation
- Neam Nightly Repository -- Latest development builds
- Chapter 00 -- Where it all started: the 85% problem