Chapter 2: Three Paradigms -- Vibe Coding, Agentic Coding, and Spec-Driven Development #

"The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise." -- Edsger W. Dijkstra


25 min read | All personas | Part I: The Problem

What you'll learn:


The Problem #

It is 2026, and everyone has an opinion about how AI should be used to build software. The conversation has split into camps. Some say: "Just describe what you want and let the AI write the code." Others say: "Give the AI agency -- let it plan, execute, and iterate." A smaller group says: "Encode your expertise first, then let agents execute within boundaries."

These are not just philosophical preferences. They produce fundamentally different outcomes in reliability, governance, and maintainability. For data engineering and ML -- where a silent bug in a feature pipeline can cost millions in bad decisions -- the choice of paradigm is the most consequential architectural decision an organization can make.

This chapter examines all three paradigms honestly: their strengths, their limitations, and the evidence for where each one works and where it breaks.

Paradigm 1: Vibe Coding #

What It Is #

Vibe coding -- a term that emerged in 2024 to describe prompt-driven code generation -- is the most accessible entry point to AI-assisted development. The workflow is simple: describe what you want in natural language, receive generated code, run it. If it does not work, refine the prompt and try again.

For a churn prediction project, vibe coding looks like this:

Prompt
PROMPT:
"Build me a customer churn prediction model. Use the orders table
and customer table from our PostgreSQL database. Train a gradient
boosted model and deploy it as a REST API."

The AI generates a Python script -- perhaps 200 lines -- that connects to the database, queries the tables, engineers some features, trains an XGBoost model, and wraps it in a Flask API. It might even work on the first try.

A Vibe Coding Example #

Here is what a vibe-coded churn pipeline might produce:

Python
# Generated by AI from prompt -- vibe coding approach
import pandas as pd
import xgboost as xgb
from flask import Flask, jsonify, request
from sqlalchemy import create_engine

engine = create_engine("postgresql://user:pass@localhost/simshop")

# Load data
orders = pd.read_sql("SELECT * FROM orders", engine)
customers = pd.read_sql("SELECT * FROM customers", engine)

# Feature engineering (AI-generated, no business context)
features = orders.groupby('customer_id').agg(
    total_orders=('order_id', 'count'),
    total_spend=('amount', 'sum'),
    last_order=('order_date', 'max'),
    avg_order_value=('amount', 'mean')
).reset_index()

merged = customers.merge(features, on='customer_id', how='left')

# Churn label -- simple recency-based definition
merged['is_churned'] = (
    pd.Timestamp.now() - pd.to_datetime(merged['last_order'])
).dt.days > 90

# Train model
X = merged[['total_orders', 'total_spend', 'avg_order_value']].fillna(0)
y = merged['is_churned'].astype(int)

model = xgb.XGBClassifier(n_estimators=100)
model.fit(X, y)

# Deploy as API
app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    pred = model.predict_proba([list(data.values())])[0][1]
    return jsonify({'churn_probability': float(pred)})

app.run(host='0.0.0.0', port=8080)

This code runs. It produces predictions. And it has at least seven critical problems:

Vibe-Coded Pipeline — Hidden Defects
  • Churn Definition — No login activity check, no seasonal exclusion. Mislabels ~12% of customers.
  • Data Quality — No NULL handling for customer_id. fillna(0) masks missing data silently.
  • No Train/Test Split — Model evaluated on training data. Reported accuracy is meaningless.
  • No Support Data — Business requirement for support ticket sentiment is completely absent.
  • No Explainability — GDPR Article 22 requires explanation of automated decisions. Not implemented.
  • Hardcoded Credentials — Database password in source code. Security vulnerability.
  • No Monitoring — No drift detection, no alerting, no rollback. Silent degradation.
Anti-Pattern

- The "It Works" Fallacy. Vibe-coded pipelines produce output that looks correct. A model that returns probabilities between 0 and 1 feels like it is working. But "produces output" and "produces correct output" are entirely different standards. In data work, the bugs that matter most are the ones that produce plausible-looking wrong answers.

Where Vibe Coding Works #

To be fair, vibe coding has legitimate use cases:

The problem is not vibe coding itself. The problem is using vibe coding for production systems where reliability, governance, and traceability matter.

Paradigm 2: Agentic Coding #

What It Is #

Agentic coding emerged in 2025 with systems like Devin, SWE-Agent, and OpenHands. Instead of generating code from a single prompt, an agentic system creates a plan, executes steps using tools (file editing, terminal commands, web browsing), observes results, and iterates until tests pass.

For the same churn prediction task, an agentic coder would:

  1. Read the project description and form a plan
  2. Explore the database schema
  3. Write the feature engineering pipeline
  4. Write the model training code
  5. Run the code, observe errors, fix them
  6. Write tests, run tests, fix failures
  7. Iterate until the test suite passes

An Agentic Coding Example #

An agentic system might produce better code than vibe coding -- it can self-correct:

Python
# Agent-generated churn pipeline -- agentic coding approach
# Agent explored the database schema first and discovered
# the support_tickets and login_events tables

import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import mlflow

# Agent found these tables by exploring the schema
orders = pd.read_sql("SELECT * FROM sales.orders", engine)
customers = pd.read_sql("SELECT * FROM crm.customers", engine)
logins = pd.read_sql("SELECT * FROM analytics.login_events", engine)
tickets = pd.read_sql("SELECT * FROM support.tickets", engine)

# Better feature engineering (agent iterated after first attempt)
features = build_features(orders, customers, logins, tickets)

# Churn definition -- agent found "90 days no purchase AND no login"
# by reading a requirements doc it found in the repo
features['is_churned'] = (
    (features['days_since_last_order'] > 90) &
    (features['days_since_last_login'] > 90)
)

# Proper train/test split (agent added after test failure)
X_train, X_test, y_train, y_test = train_test_split(
    features.drop('is_churned', axis=1),
    features['is_churned'],
    test_size=0.2, random_state=42
)

# Model training with MLflow tracking
with mlflow.start_run():
    model = xgb.XGBClassifier(n_estimators=200, max_depth=6)
    model.fit(X_train, y_train)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    mlflow.log_metric("auc", auc)
    mlflow.sklearn.log_model(model, "churn_model")

This is genuinely better. The agent discovered the login and support tables. It implemented a proper train/test split after its initial tests failed. It added MLflow tracking. The code is more robust.

What Agentic Coding Still Misses #

But agentic coding has structural limitations for enterprise data work:

Agentic Coding — Structural Gaps
  • No Formal Specs — The agent inferred requirements by reading docs it found. If the docs are wrong or absent, the agent builds the wrong thing.
  • No Quality Gates — The agent validates against its own tests. There is no independent critic. Self-assessment is unreliable.
  • No Governance — No data classification, no access control audit, no GDPR compliance check. The agent does not know about regulatory requirements.
  • No Accountability — If the model fails in production, there is no RACI trace. Who decided the churn definition? The agent. Who approved the model? The agent. Who tested it? The agent.
  • No Domain Knowledge — The agent starts fresh every time. It does not know that SimShop's CRM data has NULL customer_ids, or that November-December orders are seasonal.
  • No Lifecycle Awareness — The agent built and trained a model. It did not set up drift monitoring, canary deployment, or retraining triggers. It solved a task, not a lifecycle.
Insight

- The Self-Validation Problem. When the same entity that builds the system also tests the system, the tests are biased toward what was built rather than what was specified. This is why every engineering discipline separates construction from inspection. Agentic coding collapses this separation.

Paradigm 3: Spec-Driven Development #

What It Is #

Spec-driven development, as implemented in Neam, introduces a fundamentally different workflow:

  1. Humans encode expertise in structured specifications -- business requirements, acceptance criteria, domain knowledge (Agent.MD), quality gates, and governance policies
  2. Specialist agents execute within the boundaries defined by those specifications
  3. Independent agents validate results against the original specifications
  4. Full traceability connects every artifact to the business need that motivated it

The specifications are not prompts. They are machine-readable, compiler-validated declarations that agents reason about, not just text that agents interpret.

A Spec-Driven Example in Neam #

Here is the same churn prediction project in Neam's spec-driven approach:

Neam
// Step 1: Business requirements -- authored by humans, executed by agents

analyst_spec ChurnPrediction {
    objective: "Predict customer churn to enable proactive retention",
    success_criteria: {
        primary_metric: "auc_roc",
        threshold: 0.80,
        business_impact: "Reduce monthly churn from 8.2% to 6.5%"
    },
    acceptance_criteria: [
        given("a customer with no purchase AND no login for 90 days")
        when("excluding seasonal gift-only purchasers in Nov-Dec")
        then("classify as churned"),

        given("a trained churn model")
        when("evaluated on holdout test set")
        then("AUC-ROC exceeds 0.80"),

        given("a churn prediction for any customer")
        when("requested by a non-technical stakeholder")
        then("provide SHAP-based explanation per GDPR Article 22")
    ],
    data_sources: ["sales.orders", "crm.customers",
                   "analytics.login_events", "support.tickets"]
}
Neam
// Step 2: Agent.MD -- human domain knowledge encoded for agents

// @agent-md ChurnDomainKnowledge
// ## @causal-domain-knowledge
// - Support quality -> customer satisfaction -> retention (established)
// - Price sensitivity varies by customer segment (test before assuming)
// - November-December purchases are often gifts, not regular behavior
//
// ## @known-data-issues
// - CRM customer_id has 3.2% NULL rate (exclude, do not impute)
// - Event stream has timezone inconsistencies (normalize to UTC)
// - Order deduplication gap exists in bulk import batches
//
// ## @methodology-preferences
// - Prefer gradient boosting for tabular churn prediction
// - Always check for class imbalance before training
// - Use Bayesian approaches for causal analysis when sample allows
Neam
// Step 3: Quality gates -- blocking criteria that prevent bad deployments

quality_gate ChurnQualityGate {
    data_quality_gate: {
        required_pass_rate: 1.0,
        blocking: true,
        checks: ["no_null_keys", "timezone_normalized", "dedup_verified"]
    },
    model_quality_gate: {
        required_metrics: { auc_roc: "> 0.80", precision: "> 0.70" },
        blocking: true
    },
    explainability_gate: {
        shap_values_available: true,
        gdpr_article_22_compliant: true,
        blocking: true
    },
    api_quality_gate: {
        latency_p99: "< 200ms",
        error_rate: "< 0.1%",
        blocking: true
    }
}
Neam
// Step 4: Orchestration -- the DIO assembles the right agents

dio_task ChurnProject {
    spec: ChurnPrediction,
    quality_gate: ChurnQualityGate,
    phases: [
        { agent: "data_ba",       role: "requirements_analysis" },
        { agent: "etl",           role: "data_engineering" },
        { agent: "data_scientist", role: "model_training" },
        { agent: "causal",        role: "root_cause_analysis" },
        { agent: "data_test",     role: "quality_validation" },
        { agent: "mlops",         role: "deployment" },
        { agent: "dataops",       role: "monitoring" }
    ],
    coordination: "centralized_raci"
}

Notice what is different. The churn definition is not buried in a Python comment -- it is a formal acceptance criterion that the DataTest Agent will generate test cases from. The data quality issues are not discovered in Week 16 -- they are declared in Agent.MD so every agent knows about them from the start. The GDPR requirement is not forgotten -- it is a blocking quality gate that prevents deployment without explainability.

Insight

- Specs as the Source of Truth. In vibe coding, the prompt is the spec (and it disappears). In agentic coding, the code is the spec (and it has no business context). In spec-driven development, the spec is the spec -- it persists, is versioned, is reviewed, and drives both implementation and validation.

The Control-Speed-Quality Trilemma #

Every software development approach makes tradeoffs between three properties:

DIAGRAM Control-Speed-Quality Trilemma by Paradigm
flowchart TD
  subgraph Legend ["Paradigm Positioning"]
    V["Vibe Coding\nMax Speed\nLow Control, Low Quality"]
    A["Agentic Coding\nMedium Speed + Quality\nLimited Control"]
    S["Spec-Driven\nHigh Control + Quality + Speed\n(via upfront specs)"]
  end
  S -->|"upfront spec investment\nenables all three"| V
  V -. "low governance\nhigh risk" .-> A
  A -. "no formal specs\nno independent validation" .-> S

Vibe coding maximizes speed at the expense of control and quality. You get code in seconds, but you have no governance and limited reliability.

Agentic coding improves on quality (through self-iteration) while maintaining speed, but control remains limited. The agent makes autonomous decisions with no formal accountability.

Spec-driven development achieves all three -- but only because the upfront investment in specifications and Agent.MD creates the constraints that make speed, quality, and control simultaneously possible.

Side-by-Side Comparison #

DimensionVibe CodingAgentic CodingSpec-Driven
InputNatural lang promptTask + toolsSpecs + Agent.MD + quality gates
Who decides what to buildThe AIThe agentHumans (specs)
Who decides how to buildThe AIThe agentAgents (within spec boundaries)
Who validatesThe human (manually)The agent (self-test)Independent DataTest Agent
TraceabilityNoneCode-levelBusiness need → Spec → Code → Test → Deploy
Quality gatesNoneAgent-definedFormal, blocking, compiler-checked
GovernanceNoneNoneGDPR, RBAC, audit trail
Domain knowledgeEphemeral (in prompt)Per-session (in memory)Persistent (in Agent.MD)
Accountability when it failsNobodyThe agentRACI-assigned with audit trail
Lifecycle coverageSingle taskSingle task (iterated)Full lifecycle (7 phases)
ReproducibilityLowMedium100% (50/50 in DataSims)

The Key Distinction: Prompts vs. Specifications #

The most important conceptual difference between these paradigms is the distinction between a prompt and a specification.

A prompt is:

A specification is:

Prompt
  • "Build a churn model that predicts which customers will leave"
  • Ambiguous: what is "leave"?
  • Incomplete: what data?
  • No success criteria
  • No quality gates
  • No governance
  • Disappears after use
Specification
  • Precise churn definition: no purchase AND no login for 90 days, excluding seasonal gift-only purchasers
  • Testable acceptance criteria
  • Measurable success threshold: AUC-ROC > 0.80
  • Versioned in source control
  • Compiler-validated
Anti-Pattern

- Treating Prompts as Specs. When an organization uses chat-based AI to generate production data pipelines, they are implicitly treating the prompt as the specification. Every ambiguity in the prompt becomes a decision that the AI makes silently -- without review, without accountability, without traceability. This is the root cause of the "it works on my laptop" syndrome at AI-assisted scale.

When to Use Which Paradigm #

None of these paradigms is universally wrong. The right choice depends on the stakes:

Match Paradigm to Stakes
  • Low Stakes → Vibe Coding: Prototyping, ad-hoc analysis, learning/exploration. Fast, cheap, disposable. No governance needed. Errors are cheap.
  • Medium Stakes → Agentic Coding: Internal tools, proof of concepts, non-regulated domains. Self-correcting, more robust than vibe coding. Acceptable when full traceability not required.
  • High Stakes → Spec-Driven: Production ML pipelines, regulated industries, revenue-impacting decisions, multi-team coordination, long-lived systems. Full traceability, quality gates, governance, RACI accountability. Reproducible outcomes.

The 85% failure rate reported by Gartner applies to high-stakes projects -- the ones that involve production deployment, cross-functional teams, and business-critical decisions. These are precisely the projects where vibe coding and agentic coding fall short, and where spec-driven development provides the structural guarantees needed for success.

Try It

- Classify Your Projects. List your team's last five data/ML initiatives. For each one, classify the stakes (low, medium, high) and the paradigm that was actually used. How many high-stakes projects used a low-stakes paradigm? That mismatch is a leading indicator of the 85% failure pattern.

Industry Perspective #

The paradigm debate is not limited to data engineering. It mirrors conversations happening across software development:

Infrastructure as Code (Terraform, Pulumi) replaced ad-hoc server provisioning with declarative specifications. The same servers get deployed, but now the infrastructure is versioned, reviewable, and reproducible. Spec-driven development applies this principle to the entire data lifecycle.

BABOK v3 (Business Analysis Body of Knowledge) defines 50 elicitation techniques and emphasizes that requirements must be traceable, testable, and verifiable. Spec-driven development encodes these principles as first-class language constructs rather than process guidelines that teams may or may not follow.

DAMA-DMBOK (Data Management Body of Knowledge) emphasizes data governance, quality management, and metadata management as organizational capabilities -- not afterthoughts. Spec-driven development makes governance a compile-time concern, not a post-deployment audit.

MLOps Maturity Model (Google/Microsoft) defines Level 0 (manual) through Level 4 (fully automated). Most organizations are at Level 0-1. Spec-driven development with autonomous agents targets Level 3-4 by making automation the default and manual intervention the exception.

The Evidence #

The three paradigms produce measurably different outcomes. Using the DataSims churn prediction experiment, we can compare what each paradigm would produce:

MetricVibe CodingAgentic CodingSpec-Driven (Neam)
Time to first outputMinutesHoursHours
Churn definition correctNo (simplified)Partial (inferred)Yes (from spec)
Data quality handledNoPartial (discovered)Yes (from Agent.MD)
Support features includedNoMaybe (if discovered)Yes (from spec)
Train/test splitNo (common omission)Yes (after iteration)Yes (standard practice)
Model AUC~0.68 (mislabeled data)~0.76 (partial features)0.847 (full pipeline)
GDPR complianceNoNoYes (quality gate)
Quality gate passageN/A (no gates)N/A (self-assessed)Yes (independent agent)
Test coverage0%~40% (agent-written)94% (auto-generated)
ReproducibilityLowMedium100% (50/50 runs)
Production-readyNoPartiallyYes

The spec-driven approach does not just produce better code. It produces a better outcome because the specifications prevent the errors that vibe coding and agentic coding allow through by default -- wrong definitions, missing features, absent governance, and untested assumptions.

These results come from the DataSims evaluation platform, where the full Neam agent stack was tested 50 times across 10 experimental conditions with 100% success rate.

Key Takeaways #

For Further Exploration #