Chapter 25 — DataSims: A Simulated Enterprise for Agent Evaluation #
"All models are wrong, but some are useful." -- George Box
📖 25 min read | 👤 Dr. Chen (Researcher), David (VP Data), All personas | 🏷️ Part VII: Proof
What you'll learn:
- Why a simulated enterprise is necessary for rigorous agent evaluation
- The SimShop platform: 164 tables, 12 schemas, 15 ETL pipelines
- The 10 controlled data quality issues and why each exists
- The Docker architecture: 10 services working together
- How to set up the environment and run your own experiments
The Problem: How Do You Test an Orchestra? #
Dr. Chen, the researcher, faces a fundamental challenge. She wants to evaluate the Neam agent stack -- but on what? Production data is confidential, messy, and unreproducible. Toy datasets are too simple to stress-test multi-agent coordination. Academic benchmarks test individual capabilities, not lifecycle orchestration.
She needs something that does not exist in the literature: a complete, realistic, controlled, reproducible enterprise data environment where she can run experiments, ablate components, and measure outcomes with statistical rigor.
DataSims is that environment.
What Is DataSims? #
DataSims is a fully containerized simulation of a mid-size e-commerce company called SimShop. It provides everything a real enterprise data platform would have -- databases, ETL pipelines, ML infrastructure, metadata governance, monitoring -- but in a controlled environment where every variable is known and every experiment is reproducible.
| Company | SimShop (simulated e-commerce) |
| Revenue | $50M annual |
| Customers | 100K registered, 30K active monthly |
| Products | 10K SKUs across 50 categories |
| Data Period | Jan 2024 - Dec 2025 (24 months) |
| Channels | Web (70%), Mobile (25%), App (5%) |
| Markets | US (60%), EU (25%), APAC (15%) |
| Database | 164 tables across 12 schemas |
| ETL Pipelines | 15 scheduled jobs |
| Quality Issues | 10 controlled injections |
| Docker Services | 10 containers |
| Repository | https://github.com/neam-lang/Data-Sims |
Database Schema Architecture #
The SimShop database is built on PostgreSQL 16 with 12 schemas representing different functional areas of a modern data platform:
flowchart TB ROOT["simshop (PostgreSQL 16)"] OLTP["simshop_oltp (20 tables)\nSource transactional system"] OLTP_ITEMS["customers, customer_addresses\nproducts, product_categories, inventory\norders, order_items, order_returns\npayments, coupons\nevents (clickstream)\nsupport_tickets\ncampaigns, campaign_sends\nproduct_reviews\nsuppliers, wishlists"] STAGING["simshop_staging (intermediate)\nCleaned, validated"] DW["simshop_dw (15 tables)\nStar schema warehouse"] DW_ITEMS["dim_date, dim_customers (SCD2), dim_products\ndim_channels, dim_geography, dim_campaigns\nfact_orders, fact_customer_activity\nfact_campaign_performance, fact_support\nfact_inventory_snapshot\nagg_daily_revenue, agg_monthly_segment, agg_product_performance"] MLF["ml_features (3 tables)\nFeature store"] MLF_ITEMS["churn_features (47 features)\nrecommendation_features\nltv_features"] MLP["ml_predictions (2 tables)\nModel outputs"] MLP_ITEMS["churn_scores (with SHAP drivers)\nrecommendation_scores"] MLM["ml_monitoring (2 tables)\nDrift & performance"] MLM_ITEMS["drift_checks\nmodel_performance"] DC["data_catalog (6 tables)\nUnity Catalog simulation"] DC_ITEMS["schemas, tables, columns\nlineage, glossary, access_policies"] DQ["data_quality (2 tables)\nQuality checks"] DQ_ITEMS["check_results\nprofiling_results"] OP["operational (3 tables)\nPipeline metadata"] OP_ITEMS["pipeline_definitions (15 ETL jobs)\npipeline_runs\nalerts"] AU["audit (2 tables)\nCompliance trail"] AU_ITEMS["data_access_log\nchange_log"] ROOT --> OLTP --> OLTP_ITEMS ROOT --> STAGING ROOT --> DW --> DW_ITEMS ROOT --> MLF --> MLF_ITEMS ROOT --> MLP --> MLP_ITEMS ROOT --> MLM --> MLM_ITEMS ROOT --> DC --> DC_ITEMS ROOT --> DQ --> DQ_ITEMS ROOT --> OP --> OP_ITEMS ROOT --> AU --> AU_ITEMS
Table Count by Schema #
| Schema | Tables | Purpose |
|---|---|---|
| simshop_oltp | 20 | Source transactional data |
| simshop_staging | ~20 | Cleaned intermediate tables |
| simshop_dw | 15 | Star schema warehouse |
| ml_features | 3 | Feature store |
| ml_predictions | 2 | Model outputs |
| ml_monitoring | 2 | Drift and performance tracking |
| data_catalog | 6 | Metadata governance |
| data_quality | 2 | Quality check results |
| operational | 3 | Pipeline definitions and runs |
| audit | 2 | Compliance trail |
| Total | ~164 | Complete data platform |
💡 Why 164 tables? This matches the scale of a real mid-size enterprise data platform. Academic benchmarks typically use 5-20 tables. Production enterprises use 500-5000. DataSims sits at the complexity level where multi-agent coordination is necessary but experiments remain tractable.
ETL Pipeline Catalog #
SimShop has 15 ETL pipelines that mirror real enterprise scheduling:
| Time | Pipeline | Source → Target |
|---|---|---|
| 2 AM | raw_to_staging_customers | OLTP → Staging |
| 2 AM | raw_to_staging_orders | OLTP → Staging |
| 3 AM | raw_to_staging_events | OLTP → Staging |
| 4 AM | staging_to_dim_customers (SCD2) | Staging → DW |
| 4 AM | staging_to_dim_products | Staging → DW |
| 5 AM | staging_to_fact_orders | Staging → DW |
| 5 AM | staging_to_fact_activity | Staging → DW |
| 6 AM | dw_to_churn_features | DW → ML Features |
| 6 AM | dw_to_rec_features | DW → ML Features |
| 6 AM | dw_to_ltv_features | DW → ML Features |
| 7 AM | churn_model_scoring | Features → Predictions |
| 8 AM | daily_revenue_agg | DW → Reports |
| 1st | monthly_segment_agg | DW → Reports |
| 9 AM | data_quality_checks | DW → Quality |
| 10 AM | drift_detection | Predictions → Monitoring |
The pipeline dependencies form a DAG:
flowchart LR A["OLTP"] --> B["Staging"] B --> C["DW"] C --> D["Features"] D --> E["Predictions"] E --> F["Monitoring"] C --> G["Reports"] C --> H["Quality Checks"]
The 10 Controlled Data Quality Issues #
DataSims intentionally injects quality issues at known rates and times. This is the key differentiator from real production data: every issue is controlled, measurable, and reproducible.
gantt title Quality Issue Injection Phases (Months 1-24) dateFormat X axisFormat %s section Phases Phase 1 - Baseline (Clean data, No issues) :done, p1, 1, 12 Phase 2 - Degradation (Nulls, dupes, late data) :active, p2, 13, 18 Phase 3+4 - Schema drift, concept drift :p3, 19, 24
| # | Phase | Issue | Rate/Details | Agent Challenge |
|---|---|---|---|---|
| 1 | 2 (Mo 13-18) | Null values in non-key columns | 3% of rows | Test imputation handling |
| 2 | 2 (Mo 13-18) | Duplicate records | 1% of rows | Test deduplication logic |
| 3 | 2 (Mo 13-18) | Late-arriving data | 2% arrive 24-48h late | Test temporal consistency |
| 4 | 3 (Mo 18) | Column added to products | product_weight_kg | Test schema drift detection |
| 5 | 3 (Mo 18) | Column renamed in events | event_type to action_type | Test schema change handling |
| 6 | 4 (Mo 20-22) | Feature distribution shift | Spending patterns change | Test data drift detection |
| 7 | 4 (Mo 22) | Target pattern change | Churn rate shifts | Test concept drift detection |
| 8 | 5 (Mo 12) | Black Friday volume spike | 5x normal volume | Test scalability |
| 9 | 5 (Mo 15) | Events source outage | 48-hour gap | Test pipeline resilience |
| 10 | All | Bot traffic in events | ~5% of events | Test filtering logic |
🎯 Why controlled issues matter: In production, data quality issues are discovered after they cause damage. In DataSims, issues are injected at known rates so we can measure exactly how well the agent stack detects, classifies, and handles each type of issue.
Docker Architecture #
DataSims runs as 10 Docker containers orchestrated by Docker Compose:
flowchart TB
subgraph ENV["DataSims Docker Environment"]
direction TB
subgraph ROW1[" "]
direction LR
PG["PostgreSQL 16\n164 tables\n12 schemas\n:5432"]
ML["MLflow 2.x\nTracking +\nModel Registry\n:5000"]
UC["Unity Catalog\nMetadata +\nLineage +\nGovernance\n:8070"]
end
subgraph ROW2[" "]
direction LR
EV["Evidently\nML Monitor\nData Quality\n:8000"]
MS["Model Serving\nFlask API\nPredictions\n:8080"]
DG["Data Gen\nSynthetic\nData Engine\n(batch)"]
end
subgraph ROW3[" "]
direction LR
JL["Jupyter Lab\nNotebooks\n:8888"]
PR["Prometheus\nMetrics\n:9090"]
GR["Grafana\nDashboards\n:3000"]
end
subgraph AGENTS["Neam Agents (external, connect to Docker services)"]
AG["DIO → Data-BA → DS → Causal → DataTest → MLOps"]
end
end
Service Details #
| Service | Image | Port | Purpose |
|---|---|---|---|
| PostgreSQL | postgres:16-alpine | 5432 | All data storage (164 tables, 12 schemas) |
| MLflow | ghcr.io/mlflow/mlflow:v2.18.0 | 5000 | Experiment tracking, model registry |
| Unity Catalog | unitycatalog/unitycatalog:0.2.1 | 8070 | Metadata, lineage, governance |
| Evidently | evidentlyai/evidently-ui:latest | 8000 | ML monitoring, data quality dashboards |
| Model Serving | Custom (Flask) | 8080 | Prediction API endpoints |
| Data Generator | Custom (Python) | -- | Synthetic data generation |
| Jupyter Lab | jupyter/datascience-notebook | 8888 | Interactive analysis |
| Prometheus | prom/prometheus:latest | 9090 | Infrastructure metrics |
| Grafana | grafana/grafana:latest | 3000 | Dashboards and visualization |
Resource Requirements #
| Resource | Minimum | Recommended |
|---|---|---|
| Docker | 24.0+ | Latest stable |
| Disk | 10 GB | 20 GB (for large scale) |
| RAM | 8 GB | 16 GB |
| CPU | 4 cores | 8 cores |
Setting Up DataSims #
Quick Start (5 minutes) #
# 1. Clone the repository
git clone https://github.com/neam-lang/Data-Sims.git
cd Data-Sims
# 2. Run the automated setup script
./scripts/setup.sh small # small / medium / large
# 3. Verify services are running
cd docker
docker compose ps
Manual Setup #
# 1. Start all 10 services
cd Data-Sims/docker
docker compose up -d --build
# 2. Wait for PostgreSQL health check
echo "Waiting for PostgreSQL..."
until docker compose exec -T postgres pg_isready -U datasims -d simshop 2>/dev/null
do sleep 2; done
echo "Ready!"
# 3. Generate synthetic data
docker compose exec datagen python /app/generators/generate_all.py --scale small
Data Scale Options #
| Scale | Customers | Products | Orders | Events | Setup Time |
|---|---|---|---|---|---|
small | 10K | 1K | 200K | 5M | ~2 min |
medium | 100K | 10K | 2M | 50M | ~15 min |
large | 1M | 50K | 20M | 500M | ~2 hours |
⚠️ Start with
smallfor development and testing. Themediumscale is used for the DataSims experiments cited throughout this book. Thelargescale is for production-scale benchmarking.
Verify the Environment #
# Check table counts
docker compose exec -T postgres psql -U datasims -d simshop -c "
SELECT schemaname, COUNT(*) as table_count
FROM pg_tables
WHERE schemaname LIKE 'simshop%'
OR schemaname IN ('ml_features','ml_predictions',
'ml_monitoring','data_catalog','data_quality',
'operational','audit')
GROUP BY schemaname
ORDER BY schemaname;
"
# Test the Model API
curl http://localhost:8080/health
# Test MLflow
curl -s http://localhost:5000/api/2.0/mlflow/experiments/search
Running Neam Agents Against DataSims #
Once the environment is running, point the Neam agents at it:
# Set environment variables
export SIMSHOP_PG_URL="postgresql://datasims:datasims_2026@localhost:5432/simshop"
export MLFLOW_TRACKING_URI="http://localhost:5000"
export UNITY_CATALOG_URI="http://localhost:8070"
export EVIDENTLY_URI="http://localhost:8000"
# Compile and run the churn prediction orchestration
cd Data-Sims/neam-agents
neamc programs/simshop_churn.neam -o /tmp/simshop_churn.neamb
neam /tmp/simshop_churn.neamb
What to Check After a Run #
| What | Where | Command/URL |
|---|---|---|
| Trained models | MLflow | http://localhost:5000 |
| Feature tables | PostgreSQL | SELECT * FROM ml_features.churn_features LIMIT 10; |
| Predictions | PostgreSQL | SELECT * FROM ml_predictions.churn_scores LIMIT 20; |
| Data quality | Evidently | http://localhost:8000 |
| Drift checks | PostgreSQL | SELECT * FROM ml_monitoring.drift_checks; |
| Pipeline runs | PostgreSQL | SELECT * FROM operational.pipeline_runs; |
| Lineage | Unity Catalog | http://localhost:8070 |
| API predictions | Model API | curl http://localhost:8080/v1/churn/predict -d '{"customer_id": 42}' |
The Five Problem Statements #
DataSims defines five problem statements of varying complexity:
| # | Problem | Complexity | Key Agents | Weight |
|---|---|---|---|---|
| 1 | Customer Churn Prediction | High | Data-BA, ETL, DS, Causal, DataTest, MLOps | 25% |
| 2 | Product Recommendation | Medium | Data-BA, ETL, DS, DataTest | 20% |
| 3 | Revenue Anomaly Root Cause Analysis | Medium | Analyst, Causal, DataScientist | 20% |
| 4 | Pipeline Failure Investigation | Medium | DataOps, Causal, ETL | 20% |
| 5 | GDPR Compliance Audit | Low | Governance, DataTest, Data-BA | 15% |
The churn prediction problem (Problem 1) is the primary evaluation task used throughout this book. Chapters 26 and 27 walk through it in complete detail.
The 7-Dimension Evaluation Framework #
Every experiment is scored across 7 dimensions:
| Dimension | Weight | What It Measures |
|---|---|---|
| Speed | 20% | Time to completion |
| Quality | 25% | Output quality scores (0-100) |
| Reliability | 15% | Error detection, recovery time |
| Traceability | 15% | Requirements → tests coverage |
| Documentation | 10% | BRD, specs completeness |
| Cost Efficiency | 10% | Compute + LLM cost |
| Adaptability | 5% | Response to quality issues |
These dimensions combine into the Composite Effectiveness Score (CES), a single number from 0 to 1 that captures overall system effectiveness.
Reproducibility Guarantee #
DataSims achieves 100% reproducibility across 50 runs (10 conditions x 5 repetitions):
| Property | Value |
|---|---|
| Total runs | 50 |
| Total successes | 50 |
| Success rate | 100% |
| Platform | Darwin arm64 |
| Neam version | 0.8.0 |
The deterministic design ensures that every researcher running python3 evaluation/run_experiments.py on the same DataSims environment will produce identical results.
🎯 Reproducibility is non-negotiable for scientific claims. Every number cited in this book can be independently verified by cloning the DataSims repository and running the evaluation suite.
Key Takeaways #
- DataSims is a fully containerized simulation of a mid-size e-commerce company (SimShop)
- 164 database tables across 12 schemas provide realistic enterprise complexity
- 15 ETL pipelines mirror real scheduling patterns and dependency chains
- 10 controlled data quality issues test agent robustness at known rates
- 10 Docker services provide the complete ML platform stack (PostgreSQL, MLflow, Unity Catalog, Evidently, and more)
- Setup takes 5 minutes with
./scripts/setup.sh small - All experiments are 100% reproducible (50/50 runs successful)
- The 7-dimension evaluation framework provides rigorous, multi-faceted scoring
For Further Exploration #
- DataSims Repository -- Clone and run the experiments yourself
- Chapter 26 -- The complete churn prediction experiment, end to end
- Chapter 27 -- Ablation study proving every agent matters