Chapter 28 — The Road Ahead: From Demonstration to Production #

"The best way to predict the future is to invent it." -- Alan Kay


📖 15 min read | 👤 All personas | 🏷️ Part VIII: Vision

What you'll learn:


Where We Stand #

Over 27 chapters, we have built up a complete picture: the problem (85% failure rate), the architecture (14 agents + 1 orchestrator), the evidence (DataSims ablation study), and the results (93.7% cost reduction, 90.6% risk reduction, 100% reproducibility).

But honesty demands that we acknowledge what we have not yet proven. The DataSims environment is a simulation. The cost comparisons are modeled. The agents are orchestrated by LLMs, which means they inherit the strengths -- and the limitations -- of the underlying models.

This chapter maps the path from demonstration to production. It identifies six research frontiers where work is needed, and it invites the community to participate.


Frontier 1: Production Runtime Engines #

The DataSims experiments ran on a single machine with simulated data. Production deployments demand a different runtime:

DimensionCurrent StateFuture State
InfrastructureSingle machineDistributed cluster
DataSimulated dataReal production data
Scale5 agent crew14+ agents at scale
DurationMinutes runtimeContinuous operation
Cost$23.50 per runEnterprise-grade SLAs

What Is Needed #

Distributed agent execution. In production, agents need to run across multiple machines, with fault tolerance and load balancing. The DIO must coordinate agents that may be in different data centers, operating on different schedules, with different latency characteristics.

Persistent state management. The current checkpoint system works for single-run experiments. Production systems need persistent state that survives process restarts, infrastructure failures, and version upgrades.

Real-time event processing. The current batch-oriented architecture processes complete tasks. Production data platforms generate continuous streams of events -- schema changes, data arrivals, drift alerts -- that agents must react to in real time.

DIAGRAM Production Runtime Architecture
flowchart TB
  subgraph CP["CONTROL PLANE"]
    DIO["DIO Scheduler"]
    AR["Agent Registry"]
    SS["State Store"]
  end
  subgraph DP["DATA PLANE"]
    AN1["Agent Node 1"]
    AN2["Agent Node 2"]
    AN3["Agent Node 3"]
    ANN["Agent Node N"]
  end
  CP --> DP

Frontier 2: Human Evaluation Studies #

The DataSims experiments measure system performance against automated metrics. What they do not measure is how humans interact with the system:

Questions That Need Answers #

Study Design #

Proposed Human Evaluation Study
Participants30 data professionals (10 engineers, 10 scientists, 10 analysts)
GroupsA (manual workflow), B (Neam agents), C (Neam + human-in-loop)
Tasks3 data lifecycle tasks of increasing complexity
MeasuresCompletion time, quality score, trust survey, NASA-TLX workload
Duration2 weeks per participant
OutcomeQuantified human factors data for spec-driven development

This study would produce the first empirical evidence of how spec-driven development affects human data practitioners -- not just the automated metrics, but the subjective experience of working alongside intelligent agents.


Frontier 3: Open-Source LLM Evaluation #

The current DataSims experiments use OpenAI models (GPT-4o, o3-mini). A critical research question: how does agent quality change with different LLM backbones?

Proposed Evaluation Matrix #

LLM ProviderModelCost/1K tokensStatus
OpenAIGPT-4o$0.0025-0.01Tested
OpenAIo3-mini$0.0011-0.0044Tested
AnthropicClaude 3.5$0.003-0.015Planned
MetaLlama 3.1 405BSelf-hostedPlanned
MistralLarge 2$0.002-0.006Planned
GoogleGemini 2.5 Pro$0.00125-0.01Planned
AlibabaQwen 2.5Self-hostedPlanned

Research Questions #

  1. Quality floor: What is the minimum LLM capability required for each agent type?
  2. Cost-quality tradeoff: Can smaller, cheaper models handle simpler tasks (e.g., Data-BA) while reserving expensive models for complex tasks (e.g., Causal)?
  3. Self-hosted viability: Can organizations run the entire agent stack on-premises with open-source models?
  4. Provider diversity: Does using multiple providers improve reliability through failover?

Neam's provider-agnostic architecture makes this evaluation straightforward -- change the provider and model fields in the agent declaration, re-run the experiments, and compare CES scores.


Frontier 4: Multi-Domain Validation #

SimShop is an e-commerce company. The question remains: does spec-driven development generalize to other domains?

Proposed Domains #

DomainCharacteristicsKey Challenges
HealthcareHIPAA compliance, patient data, clinical modelsRegulatory strictness, data sensitivity
FinanceReal-time trading data, risk models, SOX complianceLatency requirements, model explainability
ManufacturingIoT sensor data, predictive maintenance, supply chainTime-series at scale, edge computing
GovernmentCensus data, policy analysis, public accountabilityData sovereignty, transparency requirements
TelecommunicationsNetwork logs, customer behavior, 5G data volumesVolume scale, real-time processing

What Multi-Domain Validation Requires #

For each domain, we need:

  1. A domain-specific DataSims environment -- equivalent to SimShop but with healthcare, finance, or manufacturing data schemas
  2. Domain-specific Agent.MD files -- encoding the institutional knowledge of each industry
  3. Domain-specific quality issues -- healthcare has missing diagnoses; finance has late trade confirmations; manufacturing has sensor drift
  4. Domain-specific compliance requirements -- HIPAA, SOX, ISO 27001, GDPR variants
DIAGRAM Domain Validation Roadmap
flowchart LR
  A["2026 Q2\nSimShop\n(e-commerce)\nCOMPLETE"] --> B["2026 Q3\nFinSim\n(financial services)\nIN DESIGN"]
  B --> C["2026 Q4\nHealthSim\n(healthcare)\nPLANNED"]
  C --> D["2027 Q1\nMfgSim\n(manufacturing)\nPLANNED"]
  D --> E["2027 Q2\nCross-domain\ncomparison\nPLANNED"]

Frontier 5: Distributed DataSims #

The current DataSims runs on a single Docker Compose stack. Production enterprises have distributed data architectures:

DIAGRAM Current vs Future DataSims Architecture
flowchart TB
  subgraph CURRENT["CURRENT: Single-Node DataSims"]
    DC["Docker Compose (1 machine)\nAll 10 services co-located"]
  end

  subgraph FUTURE["FUTURE: Distributed DataSims"]
    subgraph US["Region: US"]
      US_PG["PostgreSQL"]
      US_ML["MLflow"]
      US_AG["Agents"]
    end
    subgraph EU["Region: EU"]
      EU_PG["PostgreSQL"]
      EU_ML["MLflow"]
      EU_AG["Agents"]
    end
    subgraph APAC["Region: APAC"]
      APAC_PG["PostgreSQL"]
      APAC_ML["MLflow"]
      APAC_AG["Agents"]
    end
    GDIO["Global DIO Coordination"]
    US --> GDIO
    EU --> GDIO
    APAC --> GDIO
  end

Research Questions for Distributed DataSims #


Frontier 6: Cross-Organization Agent.MD Sharing #

Agent.MD files encode domain knowledge. Currently, each organization writes its own. But much of this knowledge is not proprietary -- it is industry best practice:

Agent.MD Knowledge Categories
CategoryExamples
Organization-SpecificTable schemas, Column naming conventions, Business rules, Data quality thresholds
Industry-SharedCommon churn prediction approaches, Standard feature engineering patterns, Regulatory compliance checklists, Deployment best practices

The Vision: Agent.MD Marketplace #

DIAGRAM Agent.MD Marketplace
flowchart TB
  subgraph MP["Agent.MD Marketplace"]
    subgraph EC["E-Commerce"]
      EC1["Churn prediction"]
      EC2["Rec engine"]
      EC3["LTV model"]
    end
    subgraph HC["Healthcare"]
      HC1["Patient risk"]
      HC2["Readmission"]
      HC3["Drug interaction"]
    end
    subgraph FN["Finance"]
      FN1["Fraud detection"]
      FN2["Credit risk"]
      FN3["AML"]
    end
    subgraph MF["Manufacturing"]
      MF1["Predictive maintenance"]
      MF2["Quality control"]
    end
  end
  MP --- NOTE["Community-contributed, peer-reviewed,\nversion-controlled domain knowledge"]

An Agent.MD marketplace would accelerate adoption by letting new organizations start with proven, peer-reviewed domain knowledge rather than encoding everything from scratch. Organizations would contribute back their refinements, creating a positive feedback loop.


The Spec-Driven Development Movement #

This book has argued for a specific thesis: the bottleneck in data engineering is not code generation but understanding what to build, why, and how to validate it. Spec-driven development addresses this bottleneck by encoding human expertise in structured, machine-readable specifications that agents execute within defined boundaries.

This is not just a technology choice. It is a philosophical position about the relationship between humans and AI agents in the data lifecycle:

The Spec-Driven Development Principles
  1. SPECIFICATIONS, NOT PROMPTS
    Human expertise is encoded in structured specs, not ad-hoc prompts. Specs are versioned, reviewed, and auditable.
  2. AGENTS WITHIN BOUNDARIES
    Agents execute within the spec's boundaries. They can be creative within those boundaries but cannot exceed them without escalation.
  3. QUALITY GATES, NOT TRUST
    We do not trust agents to produce correct output. We verify. Every phase has quality gates that must be passed.
  4. TRACEABILITY, NOT OBSERVABILITY
    It is not enough to observe what agents do. We must trace WHY they did it, back to the requirement that justified the action.
  5. COST AS A FIRST-CLASS CONSTRAINT
    Budget is not an afterthought. It is an architectural constraint that shapes agent behavior and prevents runaway spending.

What Success Looks Like #

If spec-driven development succeeds, the data industry will look different in five years:

These are ambitious claims. The DataSims results provide early evidence. The research frontiers outlined in this chapter describe the work needed to validate them at scale.


How You Can Contribute #

The Neam ecosystem is open source. Here is how different personas can contribute:

PersonaContribution
Data EngineersBuild domain-specific DataSims environments for new industries
Data ScientistsEvaluate the agent stack on novel problem types beyond churn prediction
ResearchersDesign and conduct human evaluation studies
ML EngineersTest open-source LLM backbones and contribute performance benchmarks
Business AnalystsCreate Agent.MD templates for common business domains
VP/DirectorsPilot spec-driven development in your organization and share results

Getting Started #

BASH
# Clone the DataSims repository
git clone https://github.com/neam-lang/Data-Sims.git

# Set up the environment
cd Data-Sims
./scripts/setup.sh small

# Run the experiments
python3 evaluation/run_experiments.py

# Read the results
cat evaluation/reports/experiment_report.md

Every contribution -- a new ablation experiment, a domain-specific Agent.MD, a human evaluation study, an open-source LLM benchmark -- brings us closer to solving the 85% problem.


A Final Thought #

We started this book with a number: 85%. Eighty-five percent of machine learning projects fail to reach production. Not because the algorithms are wrong. Not because the infrastructure is lacking. Because the organizational machinery that translates business intent into production systems is fundamentally broken.

Spec-driven development is our answer. It is not the only answer, and it is not yet a complete answer. But it is an answer backed by evidence, implemented in code, and open for the world to evaluate and improve.

The 85% problem is real. The solution space is open. And now you have the tools to contribute.

Let us build.


Praveen Govindaraj Creator of Neam March 2026


Key Takeaways #

For Further Exploration #