Chapter 25 — DataSims: A Simulated Enterprise for Agent Evaluation #

"All models are wrong, but some are useful." -- George Box

📖 25 min read | 👤 Dr. Chen (Researcher), David (VP Data), All personas | 🏷️ Part VII: Proof

What you'll learn:

Why a simulated enterprise is necessary for rigorous agent evaluation
The SimShop platform: 164 tables, 12 schemas, 15 ETL pipelines
The 10 controlled data quality issues and why each exists
The Docker architecture: 10 services working together
How to set up the environment and run your own experiments

The Problem: How Do You Test an Orchestra? #

Dr. Chen, the researcher, faces a fundamental challenge. She wants to evaluate the Neam agent stack -- but on what? Production data is confidential, messy, and unreproducible. Toy datasets are too simple to stress-test multi-agent coordination. Academic benchmarks test individual capabilities, not lifecycle orchestration.

She needs something that does not exist in the literature: a complete, realistic, controlled, reproducible enterprise data environment where she can run experiments, ablate components, and measure outcomes with statistical rigor.

DataSims is that environment.

What Is DataSims? #

DataSims is a fully containerized simulation of a mid-size e-commerce company called SimShop. It provides everything a real enterprise data platform would have -- databases, ETL pipelines, ML infrastructure, metadata governance, monitoring -- but in a controlled environment where every variable is known and every experiment is reproducible.

DataSims at a Glance

Company	SimShop (simulated e-commerce)
Revenue	$50M annual
Customers	100K registered, 30K active monthly
Products	10K SKUs across 50 categories
Data Period	Jan 2024 - Dec 2025 (24 months)
Channels	Web (70%), Mobile (25%), App (5%)
Markets	US (60%), EU (25%), APAC (15%)
Database	164 tables across 12 schemas
ETL Pipelines	15 scheduled jobs
Quality Issues	10 controlled injections
Docker Services	10 containers
Repository	https://github.com/neam-lang/Data-Sims

Database Schema Architecture #

The SimShop database is built on PostgreSQL 16 with 12 schemas representing different functional areas of a modern data platform:

flowchart TB
  ROOT["simshop (PostgreSQL 16)"]

  OLTP["simshop_oltp (20 tables)\nSource transactional system"]
  OLTP_ITEMS["customers, customer_addresses\nproducts, product_categories, inventory\norders, order_items, order_returns\npayments, coupons\nevents (clickstream)\nsupport_tickets\ncampaigns, campaign_sends\nproduct_reviews\nsuppliers, wishlists"]

  STAGING["simshop_staging (intermediate)\nCleaned, validated"]

  DW["simshop_dw (15 tables)\nStar schema warehouse"]
  DW_ITEMS["dim_date, dim_customers (SCD2), dim_products\ndim_channels, dim_geography, dim_campaigns\nfact_orders, fact_customer_activity\nfact_campaign_performance, fact_support\nfact_inventory_snapshot\nagg_daily_revenue, agg_monthly_segment, agg_product_performance"]

  MLF["ml_features (3 tables)\nFeature store"]
  MLF_ITEMS["churn_features (47 features)\nrecommendation_features\nltv_features"]

  MLP["ml_predictions (2 tables)\nModel outputs"]
  MLP_ITEMS["churn_scores (with SHAP drivers)\nrecommendation_scores"]

  MLM["ml_monitoring (2 tables)\nDrift & performance"]
  MLM_ITEMS["drift_checks\nmodel_performance"]

  DC["data_catalog (6 tables)\nUnity Catalog simulation"]
  DC_ITEMS["schemas, tables, columns\nlineage, glossary, access_policies"]

  DQ["data_quality (2 tables)\nQuality checks"]
  DQ_ITEMS["check_results\nprofiling_results"]

  OP["operational (3 tables)\nPipeline metadata"]
  OP_ITEMS["pipeline_definitions (15 ETL jobs)\npipeline_runs\nalerts"]

  AU["audit (2 tables)\nCompliance trail"]
  AU_ITEMS["data_access_log\nchange_log"]

  ROOT --> OLTP --> OLTP_ITEMS
  ROOT --> STAGING
  ROOT --> DW --> DW_ITEMS
  ROOT --> MLF --> MLF_ITEMS
  ROOT --> MLP --> MLP_ITEMS
  ROOT --> MLM --> MLM_ITEMS
  ROOT --> DC --> DC_ITEMS
  ROOT --> DQ --> DQ_ITEMS
  ROOT --> OP --> OP_ITEMS
  ROOT --> AU --> AU_ITEMS

Table Count by Schema #

Schema	Tables	Purpose
simshop_oltp	20	Source transactional data
simshop_staging	~20	Cleaned intermediate tables
simshop_dw	15	Star schema warehouse
ml_features	3	Feature store
ml_predictions	2	Model outputs
ml_monitoring	2	Drift and performance tracking
data_catalog	6	Metadata governance
data_quality	2	Quality check results
operational	3	Pipeline definitions and runs
audit	2	Compliance trail
Total	~164	Complete data platform

💡 Why 164 tables? This matches the scale of a real mid-size enterprise data platform. Academic benchmarks typically use 5-20 tables. Production enterprises use 500-5000. DataSims sits at the complexity level where multi-agent coordination is necessary but experiments remain tractable.

ETL Pipeline Catalog #

SimShop has 15 ETL pipelines that mirror real enterprise scheduling:

Time	Pipeline	Source → Target
2 AM	raw_to_staging_customers	OLTP → Staging
2 AM	raw_to_staging_orders	OLTP → Staging
3 AM	raw_to_staging_events	OLTP → Staging
4 AM	staging_to_dim_customers (SCD2)	Staging → DW
4 AM	staging_to_dim_products	Staging → DW
5 AM	staging_to_fact_orders	Staging → DW
5 AM	staging_to_fact_activity	Staging → DW
6 AM	dw_to_churn_features	DW → ML Features
6 AM	dw_to_rec_features	DW → ML Features
6 AM	dw_to_ltv_features	DW → ML Features
7 AM	churn_model_scoring	Features → Predictions
8 AM	daily_revenue_agg	DW → Reports
1st	monthly_segment_agg	DW → Reports
9 AM	data_quality_checks	DW → Quality
10 AM	drift_detection	Predictions → Monitoring

The pipeline dependencies form a DAG:

flowchart LR
  A["OLTP"] --> B["Staging"]
  B --> C["DW"]
  C --> D["Features"]
  D --> E["Predictions"]
  E --> F["Monitoring"]
  C --> G["Reports"]
  C --> H["Quality Checks"]

The 10 Controlled Data Quality Issues #

DataSims intentionally injects quality issues at known rates and times. This is the key differentiator from real production data: every issue is controlled, measurable, and reproducible.

gantt
  title Quality Issue Injection Phases (Months 1-24)
  dateFormat X
  axisFormat %s

  section Phases
  Phase 1 - Baseline (Clean data, No issues)           :done, p1, 1, 12
  Phase 2 - Degradation (Nulls, dupes, late data)      :active, p2, 13, 18
  Phase 3+4 - Schema drift, concept drift              :p3, 19, 24

#	Phase	Issue	Rate/Details	Agent Challenge
1	2 (Mo 13-18)	Null values in non-key columns	3% of rows	Test imputation handling
2	2 (Mo 13-18)	Duplicate records	1% of rows	Test deduplication logic
3	2 (Mo 13-18)	Late-arriving data	2% arrive 24-48h late	Test temporal consistency
4	3 (Mo 18)	Column added to products	`product_weight_kg`	Test schema drift detection
5	3 (Mo 18)	Column renamed in events	`event_type` to `action_type`	Test schema change handling
6	4 (Mo 20-22)	Feature distribution shift	Spending patterns change	Test data drift detection
7	4 (Mo 22)	Target pattern change	Churn rate shifts	Test concept drift detection
8	5 (Mo 12)	Black Friday volume spike	5x normal volume	Test scalability
9	5 (Mo 15)	Events source outage	48-hour gap	Test pipeline resilience
10	All	Bot traffic in events	~5% of events	Test filtering logic

🎯 Why controlled issues matter: In production, data quality issues are discovered after they cause damage. In DataSims, issues are injected at known rates so we can measure exactly how well the agent stack detects, classifies, and handles each type of issue.

Docker Architecture #

DataSims runs as 10 Docker containers orchestrated by Docker Compose:

flowchart TB
  subgraph ENV["DataSims Docker Environment"]
    direction TB
    subgraph ROW1[" "]
      direction LR
      PG["PostgreSQL 16\n164 tables\n12 schemas\n:5432"]
      ML["MLflow 2.x\nTracking +\nModel Registry\n:5000"]
      UC["Unity Catalog\nMetadata +\nLineage +\nGovernance\n:8070"]
    end
    subgraph ROW2[" "]
      direction LR
      EV["Evidently\nML Monitor\nData Quality\n:8000"]
      MS["Model Serving\nFlask API\nPredictions\n:8080"]
      DG["Data Gen\nSynthetic\nData Engine\n(batch)"]
    end
    subgraph ROW3[" "]
      direction LR
      JL["Jupyter Lab\nNotebooks\n:8888"]
      PR["Prometheus\nMetrics\n:9090"]
      GR["Grafana\nDashboards\n:3000"]
    end
    subgraph AGENTS["Neam Agents (external, connect to Docker services)"]
      AG["DIO → Data-BA → DS → Causal → DataTest → MLOps"]
    end
  end

Service Details #

Service	Image	Port	Purpose
PostgreSQL	postgres:16-alpine	5432	All data storage (164 tables, 12 schemas)
MLflow	ghcr.io/mlflow/mlflow:v2.18.0	5000	Experiment tracking, model registry
Unity Catalog	unitycatalog/unitycatalog:0.2.1	8070	Metadata, lineage, governance
Evidently	evidentlyai/evidently-ui:latest	8000	ML monitoring, data quality dashboards
Model Serving	Custom (Flask)	8080	Prediction API endpoints
Data Generator	Custom (Python)	--	Synthetic data generation
Jupyter Lab	jupyter/datascience-notebook	8888	Interactive analysis
Prometheus	prom/prometheus:latest	9090	Infrastructure metrics
Grafana	grafana/grafana:latest	3000	Dashboards and visualization

Resource Requirements #

Resource	Minimum	Recommended
Docker	24.0+	Latest stable
Disk	10 GB	20 GB (for large scale)
RAM	8 GB	16 GB
CPU	4 cores	8 cores

Setting Up DataSims #

Quick Start (5 minutes) #

BASH

# 1. Clone the repository
git clone https://github.com/neam-lang/Data-Sims.git
cd Data-Sims

# 2. Run the automated setup script
./scripts/setup.sh small     # small / medium / large

# 3. Verify services are running
cd docker
docker compose ps

Manual Setup #

BASH

# 1. Start all 10 services
cd Data-Sims/docker
docker compose up -d --build

# 2. Wait for PostgreSQL health check
echo "Waiting for PostgreSQL..."
until docker compose exec -T postgres pg_isready -U datasims -d simshop 2>/dev/null
do sleep 2; done
echo "Ready!"

# 3. Generate synthetic data
docker compose exec datagen python /app/generators/generate_all.py --scale small

Data Scale Options #

Scale	Customers	Products	Orders	Events	Setup Time
`small`	10K	1K	200K	5M	~2 min
`medium`	100K	10K	2M	50M	~15 min
`large`	1M	50K	20M	500M	~2 hours

⚠️ Start with small for development and testing. The medium scale is used for the DataSims experiments cited throughout this book. The large scale is for production-scale benchmarking.

Verify the Environment #

BASH

# Check table counts
docker compose exec -T postgres psql -U datasims -d simshop -c "
  SELECT schemaname, COUNT(*) as table_count
  FROM pg_tables
  WHERE schemaname LIKE 'simshop%'
     OR schemaname IN ('ml_features','ml_predictions',
        'ml_monitoring','data_catalog','data_quality',
        'operational','audit')
  GROUP BY schemaname
  ORDER BY schemaname;
"

# Test the Model API
curl http://localhost:8080/health

# Test MLflow
curl -s http://localhost:5000/api/2.0/mlflow/experiments/search

Running Neam Agents Against DataSims #

Once the environment is running, point the Neam agents at it:

BASH

# Set environment variables
export SIMSHOP_PG_URL="postgresql://datasims:datasims_2026@localhost:5432/simshop"
export MLFLOW_TRACKING_URI="http://localhost:5000"
export UNITY_CATALOG_URI="http://localhost:8070"
export EVIDENTLY_URI="http://localhost:8000"

# Compile and run the churn prediction orchestration
cd Data-Sims/neam-agents
neamc programs/simshop_churn.neam -o /tmp/simshop_churn.neamb
neam /tmp/simshop_churn.neamb

What to Check After a Run #

What	Where	Command/URL
Trained models	MLflow	http://localhost:5000
Feature tables	PostgreSQL	`SELECT * FROM ml_features.churn_features LIMIT 10;`
Predictions	PostgreSQL	`SELECT * FROM ml_predictions.churn_scores LIMIT 20;`
Data quality	Evidently	http://localhost:8000
Drift checks	PostgreSQL	`SELECT * FROM ml_monitoring.drift_checks;`
Pipeline runs	PostgreSQL	`SELECT * FROM operational.pipeline_runs;`
Lineage	Unity Catalog	http://localhost:8070
API predictions	Model API	`curl http://localhost:8080/v1/churn/predict -d '{"customer_id": 42}'`

The Five Problem Statements #

DataSims defines five problem statements of varying complexity:

#	Problem	Complexity	Key Agents	Weight
1	Customer Churn Prediction	High	Data-BA, ETL, DS, Causal, DataTest, MLOps	25%
2	Product Recommendation	Medium	Data-BA, ETL, DS, DataTest	20%
3	Revenue Anomaly Root Cause Analysis	Medium	Analyst, Causal, DataScientist	20%
4	Pipeline Failure Investigation	Medium	DataOps, Causal, ETL	20%
5	GDPR Compliance Audit	Low	Governance, DataTest, Data-BA	15%

The churn prediction problem (Problem 1) is the primary evaluation task used throughout this book. Chapters 26 and 27 walk through it in complete detail.

The 7-Dimension Evaluation Framework #

Every experiment is scored across 7 dimensions:

Dimension	Weight	What It Measures
Speed	20%	Time to completion
Quality	25%	Output quality scores (0-100)
Reliability	15%	Error detection, recovery time
Traceability	15%	Requirements → tests coverage
Documentation	10%	BRD, specs completeness
Cost Efficiency	10%	Compute + LLM cost
Adaptability	5%	Response to quality issues

These dimensions combine into the Composite Effectiveness Score (CES), a single number from 0 to 1 that captures overall system effectiveness.

Reproducibility Guarantee #

DataSims achieves 100% reproducibility across 50 runs (10 conditions x 5 repetitions):

Property	Value
Total runs	50
Total successes	50
Success rate	100%
Platform	Darwin arm64
Neam version	0.8.0

The deterministic design ensures that every researcher running python3 evaluation/run_experiments.py on the same DataSims environment will produce identical results.

🎯 Reproducibility is non-negotiable for scientific claims. Every number cited in this book can be independently verified by cloning the DataSims repository and running the evaluation suite.

Key Takeaways #

DataSims is a fully containerized simulation of a mid-size e-commerce company (SimShop)
164 database tables across 12 schemas provide realistic enterprise complexity
15 ETL pipelines mirror real scheduling patterns and dependency chains
10 controlled data quality issues test agent robustness at known rates
10 Docker services provide the complete ML platform stack (PostgreSQL, MLflow, Unity Catalog, Evidently, and more)
Setup takes 5 minutes with ./scripts/setup.sh small
All experiments are 100% reproducible (50/50 runs successful)
The 7-dimension evaluation framework provides rigorous, multi-faceted scoring

For Further Exploration #

DataSims Repository -- Clone and run the experiments yourself
Chapter 26 -- The complete churn prediction experiment, end to end
Chapter 27 -- Ablation study proving every agent matters