Chapter 17 -- The DataScientist Agent: From EDA to AutoML #

"All models are wrong, but some are useful." -- George E. P. Box

30 min read | Marcus (DS), Sarah (MLOps), Priya (DE), Dr. Chen (Researcher) | Part V: Analytical Intelligence

What you'll learn:

How the problem_statement declaration frames business problems into ML-solvable tasks
Hypothesis testing as a first-class language construct
EDA-driven technique selection that adapts to data characteristics
Volume-aware compute routing: from pandas to GPU clusters in a single declaration
Feature engineering pipelines with customer_360, behavioral, and RFM patterns
ML experiment management with AutoML, SHAP explainability, and model registry
Python virtual environment code execution within the Neam VM
DataSims proof: AUC=0.847, F1=0.723, 47 features with quality score 0.96

The Problem: The Notebook Graveyard #

Marcus has 347 Jupyter notebooks in his team's shared drive. Forty-two of them contain models that were once considered "ready for production." Three of those models actually made it to production. The rest exist in a state of suspended animation -- too valuable to delete, too disconnected from any deployment pipeline to be useful.

This is the notebook graveyard, and every data science team has one. The notebooks contain brilliant work: careful EDA, clever feature engineering, hyperparameter sweeps that took days of GPU time. But none of that work is reproducible without Marcus personally explaining which cells to run in which order, which environment variables to set, which version of scikit-learn to install, and which of the three final_model_v2_ACTUALLY_FINAL.pkl files is the real one.

The DataScientist Agent solves this by encoding the complete data science workflow -- from problem framing through model registry -- in structured, declarative Neam specifications that are reproducible, auditable, and directly consumable by downstream agents (DataTest for validation, MLOps for deployment).

Problem Framing: The `problem_statement` Declaration #

Every data science project begins with a question. The problem_statement declaration forces that question to be precise, measurable, and bounded.

NEAM

problem_statement ChurnPrediction {
    business_context: "SimShop enterprise churn has increased from 8% to 14%",
    objective: "Predict which customers will churn within 90 days",
    problem_type: "binary_classification",
    target_variable: "churned_90d",
    success_metrics: {
        primary: { metric: "auc_roc", threshold: 0.80 },
        secondary: [
            { metric: "f1", threshold: 0.65 },
            { metric: "precision_at_10", threshold: 0.75 }
        ]
    },
    constraints: {
        latency_ms: 200,
        fairness: { protected_attributes: ["gender", "age_group"],
                    max_disparity: 0.1 },
        explainability: "shap_per_prediction"
    },
    timeline: "8 weeks to MVP"
}

Key Insight: The problem_statement is not documentation -- it is a contract. The DataTest Agent (Chapter 19) reads success_metrics.primary.threshold and automatically generates tests that fail if AUC drops below 0.80. The MLOps Agent (Chapter 20) reads constraints.latency_ms and configures serving infrastructure accordingly.

Hypothesis Testing as a First-Class Construct #

Before training any model, the DataScientist Agent formulates and tests hypotheses about the data.

NEAM

hypothesis_test ChurnDrivers {
    hypotheses: [
        {
            id: "H1",
            statement: "Customers with declining login frequency over 30 days
                        have higher churn rates",
            test: "chi_squared",
            variables: ["login_trend_30d", "churned_90d"],
            significance: 0.05
        },
        {
            id: "H2",
            statement: "Support ticket volume in the last 30 days is positively
                        correlated with churn probability",
            test: "point_biserial",
            variables: ["support_tickets_30d", "churned_90d"],
            significance: 0.05
        },
        {
            id: "H3",
            statement: "Days since last order is the strongest single predictor
                        of churn",
            test: "mann_whitney_u",
            variables: ["days_since_last_order", "churned_90d"],
            significance: 0.01
        }
    ]
}

Hypotheses are tested before feature engineering begins. If H1 is rejected (login frequency does not correlate with churn), the agent does not waste compute building login-based features. This is EDA-driven technique selection -- the data tells the agent what to build.

Volume-Aware Compute Routing #

Data science compute requirements vary by five orders of magnitude. A 10,000-row dataset needs pandas. A 10-billion-row dataset needs a distributed cluster with GPU acceleration. The volume_router declaration handles this automatically.

NEAM

volume_router SmartCompute {
    tiers: [
        {
            name: "local_pandas",
            condition: { rows: "<100K", columns: "<500" },
            engine: "pandas",
            runtime: "local"
        },
        {
            name: "local_duckdb",
            condition: { rows: "100K-10M", columns: "<1000" },
            engine: "duckdb",
            runtime: "local"
        },
        {
            name: "distributed_spark",
            condition: { rows: "10M-100M" },
            engine: "pyspark",
            runtime: "databricks",
            cluster: "ml-standard-4x"
        },
        {
            name: "large_scale",
            condition: { rows: ">100M" },
            engine: "databricks_sql",
            runtime: "databricks",
            cluster: "ml-large-8x"
        },
        {
            name: "gpu_accelerated",
            condition: { model_type: "deep_learning" },
            engine: "pytorch",
            runtime: "gpu",
            accelerator: "auto"     // CUDA > Metal > OpenCL
        }
    ],
    fallback: "local_pandas"
}

flowchart TD
  A["Dataset Detected"] --> B{"Row Count?"}
  B -->|"< 100K"| C["pandas (local)"]
  B -->|"100K - 10M"| D["DuckDB (local)"]
  B -->|"> 10M"| E{"> 100M?"}
  E -->|"Yes"| F["Databricks SQL + GPU"]
  E -->|"No"| G["PySpark (cluster)"]

In the SimShop experiment, the customer dataset (100K records) routed to pandas, while the events table (50M records) routed to PySpark for aggregation before joining back to the feature table.

Cost Implication: Volume routing is not just about performance -- it is about cost. Running pandas on a $2/hr local machine versus spinning up a Databricks cluster at $14/hr per node matters at scale. The Neam budget system (Chapter 9) enforces spend limits regardless of which tier is selected.

Feature Engineering #

The feature_engineering declaration defines reusable feature pipelines. The customer_360 pipeline is the most common pattern for customer analytics.

NEAM

feature_engineering ChurnFeatures {
    pipeline: "customer_360",
    source_tables: [
        "simshop_oltp.customers",
        "simshop_oltp.orders",
        "simshop_oltp.events",
        "simshop_oltp.support_tickets"
    ],
    feature_groups: [
        {
            name: "demographic",
            features: ["tenure_days", "company_size", "industry_segment",
                       "plan_tier"]
        },
        {
            name: "behavioral",
            features: ["login_count_7d", "login_count_30d", "login_trend_30d",
                       "page_views_30d", "feature_adoption_score"],
            windows: ["7d", "14d", "30d", "90d"]
        },
        {
            name: "transactional",
            features: ["order_count_30d", "spend_total_30d", "spend_trend_30d",
                       "days_since_last_order", "avg_order_value",
                       "cart_abandonment_rate"]
        },
        {
            name: "support",
            features: ["support_tickets_30d", "avg_resolution_hours",
                       "escalation_count", "csat_score"]
        },
        {
            name: "rfm",
            features: ["recency_score", "frequency_score", "monetary_score",
                       "rfm_segment"]
        }
    ],
    target: {
        table: "ml_features.customer_360",
        primary_key: "customer_id"
    },
    quality: {
        null_threshold: 0.05,
        correlation_check: true,
        vif_threshold: 10.0
    }
}

The SimShop experiment generated 47 features across the 5 feature groups, with a quality score of 0.96 (96% of features passed all quality checks including null rate, correlation, and variance inflation factor).

ML Experiment Pipeline #

NEAM

ml_experiment ChurnModel {
    problem: ChurnPrediction,
    features: ChurnFeatures,
    compute: SmartCompute,

    split: {
        strategy: "temporal",
        train_end: "2025-09-30",
        test_start: "2025-10-01",
        test_end: "2025-12-31"
    },

    algorithms: [
        { name: "LogisticRegression", baseline: true },
        { name: "RandomForest", n_estimators: [100, 500],
          max_depth: [5, 10, 15] },
        { name: "XGBoost", learning_rate: [0.01, 0.05, 0.1],
          n_estimators: [100, 300, 500],
          max_depth: [3, 5, 7] },
        { name: "LightGBM", num_leaves: [31, 63, 127] }
    ],

    evaluation: {
        metrics: ["auc_roc", "f1", "precision_at_10", "recall",
                  "log_loss"],
        cross_validation: { folds: 5, strategy: "stratified" },
        calibration: "isotonic"
    },

    explainability: {
        method: "shap",
        global: true,
        per_prediction: true,
        top_k: 5
    },

    registry: {
        backend: "mlflow",
        experiment_name: "simshop_churn",
        auto_log: true,
        promote_threshold: { auc_roc: 0.80 }
    }
}

AutoML Configuration #

For teams that prefer automated model selection, the automl_config declaration provides a search space with time and compute constraints.

NEAM

automl_config ChurnAutoML {
    problem: ChurnPrediction,
    features: ChurnFeatures,
    search: {
        strategy: "bayesian",       // bayesian | grid | random
        max_trials: 100,
        time_budget_minutes: 60,
        early_stopping: { patience: 10, min_delta: 0.001 }
    },
    search_space: {
        algorithms: ["XGBoost", "LightGBM", "CatBoost",
                     "RandomForest", "ExtraTrees"],
        preprocessing: ["standard_scaler", "robust_scaler", "none"],
        feature_selection: ["boruta", "recursive", "none"]
    },
    constraints: {
        max_model_size_mb: 100,
        max_inference_ms: 50,
        interpretability: "high"    // prefers simpler models
    }
}

Python Code Execution #

The DataScientist Agent executes Python code in isolated virtual environments managed by the Neam VM. This is critical for reproducibility -- every experiment runs in a clean, versioned environment.

NEAM

code_interpreter ChurnInterpreter {
    runtime: "python",
    version: "3.11",
    venv: {
        name: "churn_model_env",
        packages: [
            "pandas==2.1.0", "scikit-learn==1.3.0",
            "xgboost==2.0.0", "shap==0.43.0",
            "duckdb==0.9.0", "mlflow==2.8.0"
        ],
        isolation: "full"
    },
    security: {
        network: "restricted",      // can reach MLflow, not internet
        filesystem: "sandbox",      // write to /tmp only
        max_memory_mb: 8192,
        max_cpu_seconds: 3600
    }
}

Why Neam Manages Python: Data science is a Python-dominant field. Rather than replacing Python, Neam orchestrates it -- managing environments, enforcing security boundaries, tracking costs, and ensuring reproducibility. The DataScientist Agent writes Python code, but Neam governs how that code runs.

SHAP Explainability #

Every prediction comes with an explanation. The DataScientist Agent generates both global feature importance and per-prediction SHAP values.

Global Feature Importance (SimShop Churn Model)
Feature	SHAP Importance	Relative Impact
days_since_last_order	0.24
support_tickets_30d	0.20
login_trend_30d	0.15
spend_trend_30d	0.12
cart_abandonment_rate	0.10
tenure_days	0.08
avg_resolution_hours	0.06
rfm_segment	0.03
plan_tier	0.02

The top 5 features identified by SHAP match the expert intuition: customers who have not ordered recently, who contact support frequently, whose login frequency is declining, whose spending is dropping, and who abandon carts are the most likely to churn.

Industry Perspective #

The DataScientist Agent sits at the intersection of two industry trends. First, the rise of AutoML platforms (DataRobot, H2O.ai, Google AutoML) that automate model selection. Second, the growing demand for explainable AI (EU AI Act, NIST AI RMF) that requires models to justify their predictions.

Most AutoML platforms handle one or the other. They either automate model training without adequate explainability, or they provide explainability tools that require manual integration. The DataScientist Agent handles both in a single specification: automl_config for automated search, explainability for SHAP/LIME, and problem_statement.constraints.fairness for bias detection.

According to McKinsey (2024), organizations that combine AutoML with structured experiment management see 3x faster time-to-production for ML models. The DataScientist Agent encodes this combination as a language primitive.

Evidence: DataSims Experimental Proof #

Experiment: Full System -- DataScientist Agent Results #

Setup: The complete SimShop churn prediction workflow was run 5 times with all agents active.

DataScientist Agent Outputs:

Metric	Result	Threshold	Status
AUC-ROC	0.847	>= 0.80	PASS
F1 Score	0.723	>= 0.65	PASS
Precision@10	0.82	>= 0.75	PASS
Features Created	47	--	--
Feature Quality	0.96	>= 0.90	PASS
Algorithm Selected	XGBoost	--	--

Feature Engineering Breakdown:

Feature Group	Count	Quality
Demographic	4	1.00
Behavioral (4 windows)	20	0.95
Transactional	10	0.97
Support	8	0.94
RFM	5	0.98
Total	47	0.96

Top 5 Predictive Features (SHAP):

days_since_last_order -- purchase recency is the strongest churn signal
support_tickets_30d -- support load indicates frustration
login_trend_30d -- declining engagement precedes churn
spend_trend_30d -- spending decline signals disengagement
cart_abandonment_rate -- intent without purchase indicates friction

Comparison: Ablation A5 (No Agent.MD):

When the DataScientist Agent's agent.md knowledge layer was removed (ablation no_agentmd), AUC dropped from 0.847 to 0.782 -- a 7.7% decrease. The agent still trained a model, but without domain-specific guidance in its knowledge file, it made suboptimal feature engineering choices.

Reproducibility: 5/5 runs succeeded with identical results (std=0.0 across all metrics). Full data at github.com/neam-lang/Data-Sims in evaluation/results/full_system.json.

Key Takeaways #

The problem_statement declaration frames business problems as ML-solvable tasks with measurable success criteria that downstream agents enforce
Hypothesis testing as a first-class construct prevents wasted compute on features that do not correlate with the target variable
Volume-aware compute routing automatically selects the right engine (pandas/DuckDB/PySpark/Databricks/GPU) based on data size, saving cost and time
Feature engineering pipelines produce 47 features across 5 groups (demographic, behavioral, transactional, support, RFM) with 96% quality score
ML experiments manage the full lifecycle: splitting, training, evaluation, calibration, explainability, and model registry
AutoML provides automated search with constraints on model size, inference latency, and interpretability
Python code execution runs in isolated, versioned virtual environments managed by the Neam VM
SHAP explainability is built in -- every prediction comes with per-customer feature attributions
DataSims proves: AUC=0.847, F1=0.723, 47 features at quality 0.96, with 100% reproducibility across 5 runs

Chapter 17 -- The DataScientist Agent: From EDA to AutoML #

The Problem: The Notebook Graveyard #

Problem Framing: The problem_statement Declaration #

Hypothesis Testing as a First-Class Construct #

Volume-Aware Compute Routing #

Feature Engineering #

ML Experiment Pipeline #

AutoML Configuration #

Python Code Execution #

SHAP Explainability #

Industry Perspective #

Evidence: DataSims Experimental Proof #

Experiment: Full System -- DataScientist Agent Results #

Key Takeaways #

Problem Framing: The `problem_statement` Declaration #