Chapter 17 -- The DataScientist Agent: From EDA to AutoML #
"All models are wrong, but some are useful." -- George E. P. Box
30 min read | Marcus (DS), Sarah (MLOps), Priya (DE), Dr. Chen (Researcher) | Part V: Analytical Intelligence
What you'll learn:
- How the
problem_statementdeclaration frames business problems into ML-solvable tasks - Hypothesis testing as a first-class language construct
- EDA-driven technique selection that adapts to data characteristics
- Volume-aware compute routing: from pandas to GPU clusters in a single declaration
- Feature engineering pipelines with customer_360, behavioral, and RFM patterns
- ML experiment management with AutoML, SHAP explainability, and model registry
- Python virtual environment code execution within the Neam VM
- DataSims proof: AUC=0.847, F1=0.723, 47 features with quality score 0.96
The Problem: The Notebook Graveyard #
Marcus has 347 Jupyter notebooks in his team's shared drive. Forty-two of them contain models that were once considered "ready for production." Three of those models actually made it to production. The rest exist in a state of suspended animation -- too valuable to delete, too disconnected from any deployment pipeline to be useful.
This is the notebook graveyard, and every data science team has one. The notebooks contain brilliant work: careful EDA, clever feature engineering, hyperparameter sweeps that took days of GPU time. But none of that work is reproducible without Marcus personally explaining which cells to run in which order, which environment variables to set, which version of scikit-learn to install, and which of the three final_model_v2_ACTUALLY_FINAL.pkl files is the real one.
The DataScientist Agent solves this by encoding the complete data science workflow -- from problem framing through model registry -- in structured, declarative Neam specifications that are reproducible, auditable, and directly consumable by downstream agents (DataTest for validation, MLOps for deployment).
Problem Framing: The problem_statement Declaration #
Every data science project begins with a question. The problem_statement declaration forces that question to be precise, measurable, and bounded.
problem_statement ChurnPrediction {
business_context: "SimShop enterprise churn has increased from 8% to 14%",
objective: "Predict which customers will churn within 90 days",
problem_type: "binary_classification",
target_variable: "churned_90d",
success_metrics: {
primary: { metric: "auc_roc", threshold: 0.80 },
secondary: [
{ metric: "f1", threshold: 0.65 },
{ metric: "precision_at_10", threshold: 0.75 }
]
},
constraints: {
latency_ms: 200,
fairness: { protected_attributes: ["gender", "age_group"],
max_disparity: 0.1 },
explainability: "shap_per_prediction"
},
timeline: "8 weeks to MVP"
}
Key Insight: The
problem_statementis not documentation -- it is a contract. The DataTest Agent (Chapter 19) readssuccess_metrics.primary.thresholdand automatically generates tests that fail if AUC drops below 0.80. The MLOps Agent (Chapter 20) readsconstraints.latency_msand configures serving infrastructure accordingly.
Hypothesis Testing as a First-Class Construct #
Before training any model, the DataScientist Agent formulates and tests hypotheses about the data.
hypothesis_test ChurnDrivers {
hypotheses: [
{
id: "H1",
statement: "Customers with declining login frequency over 30 days
have higher churn rates",
test: "chi_squared",
variables: ["login_trend_30d", "churned_90d"],
significance: 0.05
},
{
id: "H2",
statement: "Support ticket volume in the last 30 days is positively
correlated with churn probability",
test: "point_biserial",
variables: ["support_tickets_30d", "churned_90d"],
significance: 0.05
},
{
id: "H3",
statement: "Days since last order is the strongest single predictor
of churn",
test: "mann_whitney_u",
variables: ["days_since_last_order", "churned_90d"],
significance: 0.01
}
]
}
Hypotheses are tested before feature engineering begins. If H1 is rejected (login frequency does not correlate with churn), the agent does not waste compute building login-based features. This is EDA-driven technique selection -- the data tells the agent what to build.
Volume-Aware Compute Routing #
Data science compute requirements vary by five orders of magnitude. A 10,000-row dataset needs pandas. A 10-billion-row dataset needs a distributed cluster with GPU acceleration. The volume_router declaration handles this automatically.
volume_router SmartCompute {
tiers: [
{
name: "local_pandas",
condition: { rows: "<100K", columns: "<500" },
engine: "pandas",
runtime: "local"
},
{
name: "local_duckdb",
condition: { rows: "100K-10M", columns: "<1000" },
engine: "duckdb",
runtime: "local"
},
{
name: "distributed_spark",
condition: { rows: "10M-100M" },
engine: "pyspark",
runtime: "databricks",
cluster: "ml-standard-4x"
},
{
name: "large_scale",
condition: { rows: ">100M" },
engine: "databricks_sql",
runtime: "databricks",
cluster: "ml-large-8x"
},
{
name: "gpu_accelerated",
condition: { model_type: "deep_learning" },
engine: "pytorch",
runtime: "gpu",
accelerator: "auto" // CUDA > Metal > OpenCL
}
],
fallback: "local_pandas"
}
flowchart TD
A["Dataset Detected"] --> B{"Row Count?"}
B -->|"< 100K"| C["pandas (local)"]
B -->|"100K - 10M"| D["DuckDB (local)"]
B -->|"> 10M"| E{"> 100M?"}
E -->|"Yes"| F["Databricks SQL + GPU"]
E -->|"No"| G["PySpark (cluster)"]
In the SimShop experiment, the customer dataset (100K records) routed to pandas, while the events table (50M records) routed to PySpark for aggregation before joining back to the feature table.
Cost Implication: Volume routing is not just about performance -- it is about cost. Running pandas on a $2/hr local machine versus spinning up a Databricks cluster at $14/hr per node matters at scale. The Neam budget system (Chapter 9) enforces spend limits regardless of which tier is selected.
Feature Engineering #
The feature_engineering declaration defines reusable feature pipelines. The customer_360 pipeline is the most common pattern for customer analytics.
feature_engineering ChurnFeatures {
pipeline: "customer_360",
source_tables: [
"simshop_oltp.customers",
"simshop_oltp.orders",
"simshop_oltp.events",
"simshop_oltp.support_tickets"
],
feature_groups: [
{
name: "demographic",
features: ["tenure_days", "company_size", "industry_segment",
"plan_tier"]
},
{
name: "behavioral",
features: ["login_count_7d", "login_count_30d", "login_trend_30d",
"page_views_30d", "feature_adoption_score"],
windows: ["7d", "14d", "30d", "90d"]
},
{
name: "transactional",
features: ["order_count_30d", "spend_total_30d", "spend_trend_30d",
"days_since_last_order", "avg_order_value",
"cart_abandonment_rate"]
},
{
name: "support",
features: ["support_tickets_30d", "avg_resolution_hours",
"escalation_count", "csat_score"]
},
{
name: "rfm",
features: ["recency_score", "frequency_score", "monetary_score",
"rfm_segment"]
}
],
target: {
table: "ml_features.customer_360",
primary_key: "customer_id"
},
quality: {
null_threshold: 0.05,
correlation_check: true,
vif_threshold: 10.0
}
}
The SimShop experiment generated 47 features across the 5 feature groups, with a quality score of 0.96 (96% of features passed all quality checks including null rate, correlation, and variance inflation factor).
ML Experiment Pipeline #
ml_experiment ChurnModel {
problem: ChurnPrediction,
features: ChurnFeatures,
compute: SmartCompute,
split: {
strategy: "temporal",
train_end: "2025-09-30",
test_start: "2025-10-01",
test_end: "2025-12-31"
},
algorithms: [
{ name: "LogisticRegression", baseline: true },
{ name: "RandomForest", n_estimators: [100, 500],
max_depth: [5, 10, 15] },
{ name: "XGBoost", learning_rate: [0.01, 0.05, 0.1],
n_estimators: [100, 300, 500],
max_depth: [3, 5, 7] },
{ name: "LightGBM", num_leaves: [31, 63, 127] }
],
evaluation: {
metrics: ["auc_roc", "f1", "precision_at_10", "recall",
"log_loss"],
cross_validation: { folds: 5, strategy: "stratified" },
calibration: "isotonic"
},
explainability: {
method: "shap",
global: true,
per_prediction: true,
top_k: 5
},
registry: {
backend: "mlflow",
experiment_name: "simshop_churn",
auto_log: true,
promote_threshold: { auc_roc: 0.80 }
}
}
AutoML Configuration #
For teams that prefer automated model selection, the automl_config declaration provides a search space with time and compute constraints.
automl_config ChurnAutoML {
problem: ChurnPrediction,
features: ChurnFeatures,
search: {
strategy: "bayesian", // bayesian | grid | random
max_trials: 100,
time_budget_minutes: 60,
early_stopping: { patience: 10, min_delta: 0.001 }
},
search_space: {
algorithms: ["XGBoost", "LightGBM", "CatBoost",
"RandomForest", "ExtraTrees"],
preprocessing: ["standard_scaler", "robust_scaler", "none"],
feature_selection: ["boruta", "recursive", "none"]
},
constraints: {
max_model_size_mb: 100,
max_inference_ms: 50,
interpretability: "high" // prefers simpler models
}
}
Python Code Execution #
The DataScientist Agent executes Python code in isolated virtual environments managed by the Neam VM. This is critical for reproducibility -- every experiment runs in a clean, versioned environment.
code_interpreter ChurnInterpreter {
runtime: "python",
version: "3.11",
venv: {
name: "churn_model_env",
packages: [
"pandas==2.1.0", "scikit-learn==1.3.0",
"xgboost==2.0.0", "shap==0.43.0",
"duckdb==0.9.0", "mlflow==2.8.0"
],
isolation: "full"
},
security: {
network: "restricted", // can reach MLflow, not internet
filesystem: "sandbox", // write to /tmp only
max_memory_mb: 8192,
max_cpu_seconds: 3600
}
}
Why Neam Manages Python: Data science is a Python-dominant field. Rather than replacing Python, Neam orchestrates it -- managing environments, enforcing security boundaries, tracking costs, and ensuring reproducibility. The DataScientist Agent writes Python code, but Neam governs how that code runs.
SHAP Explainability #
Every prediction comes with an explanation. The DataScientist Agent generates both global feature importance and per-prediction SHAP values.
| Feature | SHAP Importance | Relative Impact |
|---|---|---|
| days_since_last_order | 0.24 | |
| support_tickets_30d | 0.20 | |
| login_trend_30d | 0.15 | |
| spend_trend_30d | 0.12 | |
| cart_abandonment_rate | 0.10 | |
| tenure_days | 0.08 | |
| avg_resolution_hours | 0.06 | |
| rfm_segment | 0.03 | |
| plan_tier | 0.02 |
The top 5 features identified by SHAP match the expert intuition: customers who have not ordered recently, who contact support frequently, whose login frequency is declining, whose spending is dropping, and who abandon carts are the most likely to churn.
Industry Perspective #
The DataScientist Agent sits at the intersection of two industry trends. First, the rise of AutoML platforms (DataRobot, H2O.ai, Google AutoML) that automate model selection. Second, the growing demand for explainable AI (EU AI Act, NIST AI RMF) that requires models to justify their predictions.
Most AutoML platforms handle one or the other. They either automate model training without adequate explainability, or they provide explainability tools that require manual integration. The DataScientist Agent handles both in a single specification: automl_config for automated search, explainability for SHAP/LIME, and problem_statement.constraints.fairness for bias detection.
According to McKinsey (2024), organizations that combine AutoML with structured experiment management see 3x faster time-to-production for ML models. The DataScientist Agent encodes this combination as a language primitive.
Evidence: DataSims Experimental Proof #
Experiment: Full System -- DataScientist Agent Results #
Setup: The complete SimShop churn prediction workflow was run 5 times with all agents active.
DataScientist Agent Outputs:
| Metric | Result | Threshold | Status |
|---|---|---|---|
| AUC-ROC | 0.847 | >= 0.80 | PASS |
| F1 Score | 0.723 | >= 0.65 | PASS |
| Precision@10 | 0.82 | >= 0.75 | PASS |
| Features Created | 47 | -- | -- |
| Feature Quality | 0.96 | >= 0.90 | PASS |
| Algorithm Selected | XGBoost | -- | -- |
Feature Engineering Breakdown:
| Feature Group | Count | Quality |
|---|---|---|
| Demographic | 4 | 1.00 |
| Behavioral (4 windows) | 20 | 0.95 |
| Transactional | 10 | 0.97 |
| Support | 8 | 0.94 |
| RFM | 5 | 0.98 |
| Total | 47 | 0.96 |
Top 5 Predictive Features (SHAP):
days_since_last_order-- purchase recency is the strongest churn signalsupport_tickets_30d-- support load indicates frustrationlogin_trend_30d-- declining engagement precedes churnspend_trend_30d-- spending decline signals disengagementcart_abandonment_rate-- intent without purchase indicates friction
Comparison: Ablation A5 (No Agent.MD):
When the DataScientist Agent's agent.md knowledge layer was removed (ablation no_agentmd), AUC dropped from 0.847 to 0.782 -- a 7.7% decrease. The agent still trained a model, but without domain-specific guidance in its knowledge file, it made suboptimal feature engineering choices.
Reproducibility: 5/5 runs succeeded with identical results (std=0.0 across all metrics). Full data at github.com/neam-lang/Data-Sims in evaluation/results/full_system.json.
Key Takeaways #
- The
problem_statementdeclaration frames business problems as ML-solvable tasks with measurable success criteria that downstream agents enforce - Hypothesis testing as a first-class construct prevents wasted compute on features that do not correlate with the target variable
- Volume-aware compute routing automatically selects the right engine (pandas/DuckDB/PySpark/Databricks/GPU) based on data size, saving cost and time
- Feature engineering pipelines produce 47 features across 5 groups (demographic, behavioral, transactional, support, RFM) with 96% quality score
- ML experiments manage the full lifecycle: splitting, training, evaluation, calibration, explainability, and model registry
- AutoML provides automated search with constraints on model size, inference latency, and interpretability
- Python code execution runs in isolated, versioned virtual environments managed by the Neam VM
- SHAP explainability is built in -- every prediction comes with per-customer feature attributions
- DataSims proves: AUC=0.847, F1=0.723, 47 features at quality 0.96, with 100% reproducibility across 5 runs