Chapter 25 — DataSims: A Simulated Enterprise for Agent Evaluation #

"All models are wrong, but some are useful." -- George Box


📖 25 min read | 👤 Dr. Chen (Researcher), David (VP Data), All personas | 🏷️ Part VII: Proof

What you'll learn:


The Problem: How Do You Test an Orchestra? #

Dr. Chen, the researcher, faces a fundamental challenge. She wants to evaluate the Neam agent stack -- but on what? Production data is confidential, messy, and unreproducible. Toy datasets are too simple to stress-test multi-agent coordination. Academic benchmarks test individual capabilities, not lifecycle orchestration.

She needs something that does not exist in the literature: a complete, realistic, controlled, reproducible enterprise data environment where she can run experiments, ablate components, and measure outcomes with statistical rigor.

DataSims is that environment.


What Is DataSims? #

DataSims is a fully containerized simulation of a mid-size e-commerce company called SimShop. It provides everything a real enterprise data platform would have -- databases, ETL pipelines, ML infrastructure, metadata governance, monitoring -- but in a controlled environment where every variable is known and every experiment is reproducible.

DataSims at a Glance
CompanySimShop (simulated e-commerce)
Revenue$50M annual
Customers100K registered, 30K active monthly
Products10K SKUs across 50 categories
Data PeriodJan 2024 - Dec 2025 (24 months)
ChannelsWeb (70%), Mobile (25%), App (5%)
MarketsUS (60%), EU (25%), APAC (15%)
Database164 tables across 12 schemas
ETL Pipelines15 scheduled jobs
Quality Issues10 controlled injections
Docker Services10 containers
Repositoryhttps://github.com/neam-lang/Data-Sims
Architecture Diagram

Database Schema Architecture #

The SimShop database is built on PostgreSQL 16 with 12 schemas representing different functional areas of a modern data platform:

DIAGRAM SimShop Database Schema Architecture
flowchart TB
  ROOT["simshop (PostgreSQL 16)"]

  OLTP["simshop_oltp (20 tables)\nSource transactional system"]
  OLTP_ITEMS["customers, customer_addresses\nproducts, product_categories, inventory\norders, order_items, order_returns\npayments, coupons\nevents (clickstream)\nsupport_tickets\ncampaigns, campaign_sends\nproduct_reviews\nsuppliers, wishlists"]

  STAGING["simshop_staging (intermediate)\nCleaned, validated"]

  DW["simshop_dw (15 tables)\nStar schema warehouse"]
  DW_ITEMS["dim_date, dim_customers (SCD2), dim_products\ndim_channels, dim_geography, dim_campaigns\nfact_orders, fact_customer_activity\nfact_campaign_performance, fact_support\nfact_inventory_snapshot\nagg_daily_revenue, agg_monthly_segment, agg_product_performance"]

  MLF["ml_features (3 tables)\nFeature store"]
  MLF_ITEMS["churn_features (47 features)\nrecommendation_features\nltv_features"]

  MLP["ml_predictions (2 tables)\nModel outputs"]
  MLP_ITEMS["churn_scores (with SHAP drivers)\nrecommendation_scores"]

  MLM["ml_monitoring (2 tables)\nDrift & performance"]
  MLM_ITEMS["drift_checks\nmodel_performance"]

  DC["data_catalog (6 tables)\nUnity Catalog simulation"]
  DC_ITEMS["schemas, tables, columns\nlineage, glossary, access_policies"]

  DQ["data_quality (2 tables)\nQuality checks"]
  DQ_ITEMS["check_results\nprofiling_results"]

  OP["operational (3 tables)\nPipeline metadata"]
  OP_ITEMS["pipeline_definitions (15 ETL jobs)\npipeline_runs\nalerts"]

  AU["audit (2 tables)\nCompliance trail"]
  AU_ITEMS["data_access_log\nchange_log"]

  ROOT --> OLTP --> OLTP_ITEMS
  ROOT --> STAGING
  ROOT --> DW --> DW_ITEMS
  ROOT --> MLF --> MLF_ITEMS
  ROOT --> MLP --> MLP_ITEMS
  ROOT --> MLM --> MLM_ITEMS
  ROOT --> DC --> DC_ITEMS
  ROOT --> DQ --> DQ_ITEMS
  ROOT --> OP --> OP_ITEMS
  ROOT --> AU --> AU_ITEMS

Table Count by Schema #

SchemaTablesPurpose
simshop_oltp20Source transactional data
simshop_staging~20Cleaned intermediate tables
simshop_dw15Star schema warehouse
ml_features3Feature store
ml_predictions2Model outputs
ml_monitoring2Drift and performance tracking
data_catalog6Metadata governance
data_quality2Quality check results
operational3Pipeline definitions and runs
audit2Compliance trail
Total~164Complete data platform

💡 Why 164 tables? This matches the scale of a real mid-size enterprise data platform. Academic benchmarks typically use 5-20 tables. Production enterprises use 500-5000. DataSims sits at the complexity level where multi-agent coordination is necessary but experiments remain tractable.


ETL Pipeline Catalog #

SimShop has 15 ETL pipelines that mirror real enterprise scheduling:

TimePipelineSource → Target
2 AMraw_to_staging_customersOLTP → Staging
2 AMraw_to_staging_ordersOLTP → Staging
3 AMraw_to_staging_eventsOLTP → Staging
4 AMstaging_to_dim_customers (SCD2)Staging → DW
4 AMstaging_to_dim_productsStaging → DW
5 AMstaging_to_fact_ordersStaging → DW
5 AMstaging_to_fact_activityStaging → DW
6 AMdw_to_churn_featuresDW → ML Features
6 AMdw_to_rec_featuresDW → ML Features
6 AMdw_to_ltv_featuresDW → ML Features
7 AMchurn_model_scoringFeatures → Predictions
8 AMdaily_revenue_aggDW → Reports
1stmonthly_segment_aggDW → Reports
9 AMdata_quality_checksDW → Quality
10 AMdrift_detectionPredictions → Monitoring

The pipeline dependencies form a DAG:

DIAGRAM ETL Pipeline Dependency DAG
flowchart LR
  A["OLTP"] --> B["Staging"]
  B --> C["DW"]
  C --> D["Features"]
  D --> E["Predictions"]
  E --> F["Monitoring"]
  C --> G["Reports"]
  C --> H["Quality Checks"]

The 10 Controlled Data Quality Issues #

DataSims intentionally injects quality issues at known rates and times. This is the key differentiator from real production data: every issue is controlled, measurable, and reproducible.

DIAGRAM Quality Issue Timeline
gantt
  title Quality Issue Injection Phases (Months 1-24)
  dateFormat X
  axisFormat %s

  section Phases
  Phase 1 - Baseline (Clean data, No issues)           :done, p1, 1, 12
  Phase 2 - Degradation (Nulls, dupes, late data)      :active, p2, 13, 18
  Phase 3+4 - Schema drift, concept drift              :p3, 19, 24
#PhaseIssueRate/DetailsAgent Challenge
12 (Mo 13-18)Null values in non-key columns3% of rowsTest imputation handling
22 (Mo 13-18)Duplicate records1% of rowsTest deduplication logic
32 (Mo 13-18)Late-arriving data2% arrive 24-48h lateTest temporal consistency
43 (Mo 18)Column added to productsproduct_weight_kgTest schema drift detection
53 (Mo 18)Column renamed in eventsevent_type to action_typeTest schema change handling
64 (Mo 20-22)Feature distribution shiftSpending patterns changeTest data drift detection
74 (Mo 22)Target pattern changeChurn rate shiftsTest concept drift detection
85 (Mo 12)Black Friday volume spike5x normal volumeTest scalability
95 (Mo 15)Events source outage48-hour gapTest pipeline resilience
10AllBot traffic in events~5% of eventsTest filtering logic

🎯 Why controlled issues matter: In production, data quality issues are discovered after they cause damage. In DataSims, issues are injected at known rates so we can measure exactly how well the agent stack detects, classifies, and handles each type of issue.


Docker Architecture #

DataSims runs as 10 Docker containers orchestrated by Docker Compose:

DIAGRAM DataSims Docker Environment
flowchart TB
  subgraph ENV["DataSims Docker Environment"]
    direction TB
    subgraph ROW1[" "]
      direction LR
      PG["PostgreSQL 16\n164 tables\n12 schemas\n:5432"]
      ML["MLflow 2.x\nTracking +\nModel Registry\n:5000"]
      UC["Unity Catalog\nMetadata +\nLineage +\nGovernance\n:8070"]
    end
    subgraph ROW2[" "]
      direction LR
      EV["Evidently\nML Monitor\nData Quality\n:8000"]
      MS["Model Serving\nFlask API\nPredictions\n:8080"]
      DG["Data Gen\nSynthetic\nData Engine\n(batch)"]
    end
    subgraph ROW3[" "]
      direction LR
      JL["Jupyter Lab\nNotebooks\n:8888"]
      PR["Prometheus\nMetrics\n:9090"]
      GR["Grafana\nDashboards\n:3000"]
    end
    subgraph AGENTS["Neam Agents (external, connect to Docker services)"]
      AG["DIO → Data-BA → DS → Causal → DataTest → MLOps"]
    end
  end

Service Details #

ServiceImagePortPurpose
PostgreSQLpostgres:16-alpine5432All data storage (164 tables, 12 schemas)
MLflowghcr.io/mlflow/mlflow:v2.18.05000Experiment tracking, model registry
Unity Catalogunitycatalog/unitycatalog:0.2.18070Metadata, lineage, governance
Evidentlyevidentlyai/evidently-ui:latest8000ML monitoring, data quality dashboards
Model ServingCustom (Flask)8080Prediction API endpoints
Data GeneratorCustom (Python)--Synthetic data generation
Jupyter Labjupyter/datascience-notebook8888Interactive analysis
Prometheusprom/prometheus:latest9090Infrastructure metrics
Grafanagrafana/grafana:latest3000Dashboards and visualization

Resource Requirements #

ResourceMinimumRecommended
Docker24.0+Latest stable
Disk10 GB20 GB (for large scale)
RAM8 GB16 GB
CPU4 cores8 cores

Setting Up DataSims #

Quick Start (5 minutes) #

BASH
# 1. Clone the repository
git clone https://github.com/neam-lang/Data-Sims.git
cd Data-Sims

# 2. Run the automated setup script
./scripts/setup.sh small     # small / medium / large

# 3. Verify services are running
cd docker
docker compose ps

Manual Setup #

BASH
# 1. Start all 10 services
cd Data-Sims/docker
docker compose up -d --build

# 2. Wait for PostgreSQL health check
echo "Waiting for PostgreSQL..."
until docker compose exec -T postgres pg_isready -U datasims -d simshop 2>/dev/null
do sleep 2; done
echo "Ready!"

# 3. Generate synthetic data
docker compose exec datagen python /app/generators/generate_all.py --scale small

Data Scale Options #

ScaleCustomersProductsOrdersEventsSetup Time
small10K1K200K5M~2 min
medium100K10K2M50M~15 min
large1M50K20M500M~2 hours

⚠️ Start with small for development and testing. The medium scale is used for the DataSims experiments cited throughout this book. The large scale is for production-scale benchmarking.

Verify the Environment #

BASH
# Check table counts
docker compose exec -T postgres psql -U datasims -d simshop -c "
  SELECT schemaname, COUNT(*) as table_count
  FROM pg_tables
  WHERE schemaname LIKE 'simshop%'
     OR schemaname IN ('ml_features','ml_predictions',
        'ml_monitoring','data_catalog','data_quality',
        'operational','audit')
  GROUP BY schemaname
  ORDER BY schemaname;
"

# Test the Model API
curl http://localhost:8080/health

# Test MLflow
curl -s http://localhost:5000/api/2.0/mlflow/experiments/search

Running Neam Agents Against DataSims #

Once the environment is running, point the Neam agents at it:

BASH
# Set environment variables
export SIMSHOP_PG_URL="postgresql://datasims:datasims_2026@localhost:5432/simshop"
export MLFLOW_TRACKING_URI="http://localhost:5000"
export UNITY_CATALOG_URI="http://localhost:8070"
export EVIDENTLY_URI="http://localhost:8000"

# Compile and run the churn prediction orchestration
cd Data-Sims/neam-agents
neamc programs/simshop_churn.neam -o /tmp/simshop_churn.neamb
neam /tmp/simshop_churn.neamb

What to Check After a Run #

WhatWhereCommand/URL
Trained modelsMLflowhttp://localhost:5000
Feature tablesPostgreSQLSELECT * FROM ml_features.churn_features LIMIT 10;
PredictionsPostgreSQLSELECT * FROM ml_predictions.churn_scores LIMIT 20;
Data qualityEvidentlyhttp://localhost:8000
Drift checksPostgreSQLSELECT * FROM ml_monitoring.drift_checks;
Pipeline runsPostgreSQLSELECT * FROM operational.pipeline_runs;
LineageUnity Cataloghttp://localhost:8070
API predictionsModel APIcurl http://localhost:8080/v1/churn/predict -d '{"customer_id": 42}'

The Five Problem Statements #

DataSims defines five problem statements of varying complexity:

#ProblemComplexityKey AgentsWeight
1Customer Churn PredictionHighData-BA, ETL, DS, Causal, DataTest, MLOps25%
2Product RecommendationMediumData-BA, ETL, DS, DataTest20%
3Revenue Anomaly Root Cause AnalysisMediumAnalyst, Causal, DataScientist20%
4Pipeline Failure InvestigationMediumDataOps, Causal, ETL20%
5GDPR Compliance AuditLowGovernance, DataTest, Data-BA15%

The churn prediction problem (Problem 1) is the primary evaluation task used throughout this book. Chapters 26 and 27 walk through it in complete detail.


The 7-Dimension Evaluation Framework #

Every experiment is scored across 7 dimensions:

DimensionWeightWhat It Measures
Speed20%Time to completion
Quality25%Output quality scores (0-100)
Reliability15%Error detection, recovery time
Traceability15%Requirements → tests coverage
Documentation10%BRD, specs completeness
Cost Efficiency10%Compute + LLM cost
Adaptability5%Response to quality issues

These dimensions combine into the Composite Effectiveness Score (CES), a single number from 0 to 1 that captures overall system effectiveness.


Reproducibility Guarantee #

DataSims achieves 100% reproducibility across 50 runs (10 conditions x 5 repetitions):

PropertyValue
Total runs50
Total successes50
Success rate100%
PlatformDarwin arm64
Neam version0.8.0

The deterministic design ensures that every researcher running python3 evaluation/run_experiments.py on the same DataSims environment will produce identical results.

🎯 Reproducibility is non-negotiable for scientific claims. Every number cited in this book can be independently verified by cloning the DataSims repository and running the evaluation suite.


Key Takeaways #

For Further Exploration #