Chapter 2: Three Paradigms -- Vibe Coding, Agentic Coding, and Spec-Driven Development #
"The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise." -- Edsger W. Dijkstra
25 min read | All personas | Part I: The Problem
What you'll learn:
- The three paradigms of AI-assisted development and what distinguishes them
- Why prompts are not specifications and why that distinction matters
- How each paradigm handles the control-speed-quality trilemma
- Concrete code examples showing the same churn prediction task in all three paradigms
- Why spec-driven development is the only paradigm that scales for enterprise data work
The Problem #
It is 2026, and everyone has an opinion about how AI should be used to build software. The conversation has split into camps. Some say: "Just describe what you want and let the AI write the code." Others say: "Give the AI agency -- let it plan, execute, and iterate." A smaller group says: "Encode your expertise first, then let agents execute within boundaries."
These are not just philosophical preferences. They produce fundamentally different outcomes in reliability, governance, and maintainability. For data engineering and ML -- where a silent bug in a feature pipeline can cost millions in bad decisions -- the choice of paradigm is the most consequential architectural decision an organization can make.
This chapter examines all three paradigms honestly: their strengths, their limitations, and the evidence for where each one works and where it breaks.
Paradigm 1: Vibe Coding #
What It Is #
Vibe coding -- a term that emerged in 2024 to describe prompt-driven code generation -- is the most accessible entry point to AI-assisted development. The workflow is simple: describe what you want in natural language, receive generated code, run it. If it does not work, refine the prompt and try again.
For a churn prediction project, vibe coding looks like this:
PROMPT:
"Build me a customer churn prediction model. Use the orders table
and customer table from our PostgreSQL database. Train a gradient
boosted model and deploy it as a REST API."
The AI generates a Python script -- perhaps 200 lines -- that connects to the database, queries the tables, engineers some features, trains an XGBoost model, and wraps it in a Flask API. It might even work on the first try.
A Vibe Coding Example #
Here is what a vibe-coded churn pipeline might produce:
# Generated by AI from prompt -- vibe coding approach
import pandas as pd
import xgboost as xgb
from flask import Flask, jsonify, request
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@localhost/simshop")
# Load data
orders = pd.read_sql("SELECT * FROM orders", engine)
customers = pd.read_sql("SELECT * FROM customers", engine)
# Feature engineering (AI-generated, no business context)
features = orders.groupby('customer_id').agg(
total_orders=('order_id', 'count'),
total_spend=('amount', 'sum'),
last_order=('order_date', 'max'),
avg_order_value=('amount', 'mean')
).reset_index()
merged = customers.merge(features, on='customer_id', how='left')
# Churn label -- simple recency-based definition
merged['is_churned'] = (
pd.Timestamp.now() - pd.to_datetime(merged['last_order'])
).dt.days > 90
# Train model
X = merged[['total_orders', 'total_spend', 'avg_order_value']].fillna(0)
y = merged['is_churned'].astype(int)
model = xgb.XGBClassifier(n_estimators=100)
model.fit(X, y)
# Deploy as API
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
pred = model.predict_proba([list(data.values())])[0][1]
return jsonify({'churn_probability': float(pred)})
app.run(host='0.0.0.0', port=8080)
This code runs. It produces predictions. And it has at least seven critical problems:
- Churn Definition — No login activity check, no seasonal exclusion. Mislabels ~12% of customers.
- Data Quality — No NULL handling for customer_id.
fillna(0)masks missing data silently. - No Train/Test Split — Model evaluated on training data. Reported accuracy is meaningless.
- No Support Data — Business requirement for support ticket sentiment is completely absent.
- No Explainability — GDPR Article 22 requires explanation of automated decisions. Not implemented.
- Hardcoded Credentials — Database password in source code. Security vulnerability.
- No Monitoring — No drift detection, no alerting, no rollback. Silent degradation.
- The "It Works" Fallacy. Vibe-coded pipelines produce output that looks correct. A model that returns probabilities between 0 and 1 feels like it is working. But "produces output" and "produces correct output" are entirely different standards. In data work, the bugs that matter most are the ones that produce plausible-looking wrong answers.
Where Vibe Coding Works #
To be fair, vibe coding has legitimate use cases:
- Rapid prototyping -- exploring whether an approach is feasible before investing in production engineering
- One-off analyses -- ad-hoc queries and visualizations that will not be reused
- Learning -- understanding how a technique works by generating and studying example code
The problem is not vibe coding itself. The problem is using vibe coding for production systems where reliability, governance, and traceability matter.
Paradigm 2: Agentic Coding #
What It Is #
Agentic coding emerged in 2025 with systems like Devin, SWE-Agent, and OpenHands. Instead of generating code from a single prompt, an agentic system creates a plan, executes steps using tools (file editing, terminal commands, web browsing), observes results, and iterates until tests pass.
For the same churn prediction task, an agentic coder would:
- Read the project description and form a plan
- Explore the database schema
- Write the feature engineering pipeline
- Write the model training code
- Run the code, observe errors, fix them
- Write tests, run tests, fix failures
- Iterate until the test suite passes
An Agentic Coding Example #
An agentic system might produce better code than vibe coding -- it can self-correct:
# Agent-generated churn pipeline -- agentic coding approach
# Agent explored the database schema first and discovered
# the support_tickets and login_events tables
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import mlflow
# Agent found these tables by exploring the schema
orders = pd.read_sql("SELECT * FROM sales.orders", engine)
customers = pd.read_sql("SELECT * FROM crm.customers", engine)
logins = pd.read_sql("SELECT * FROM analytics.login_events", engine)
tickets = pd.read_sql("SELECT * FROM support.tickets", engine)
# Better feature engineering (agent iterated after first attempt)
features = build_features(orders, customers, logins, tickets)
# Churn definition -- agent found "90 days no purchase AND no login"
# by reading a requirements doc it found in the repo
features['is_churned'] = (
(features['days_since_last_order'] > 90) &
(features['days_since_last_login'] > 90)
)
# Proper train/test split (agent added after test failure)
X_train, X_test, y_train, y_test = train_test_split(
features.drop('is_churned', axis=1),
features['is_churned'],
test_size=0.2, random_state=42
)
# Model training with MLflow tracking
with mlflow.start_run():
model = xgb.XGBClassifier(n_estimators=200, max_depth=6)
model.fit(X_train, y_train)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
mlflow.log_metric("auc", auc)
mlflow.sklearn.log_model(model, "churn_model")
This is genuinely better. The agent discovered the login and support tables. It implemented a proper train/test split after its initial tests failed. It added MLflow tracking. The code is more robust.
What Agentic Coding Still Misses #
But agentic coding has structural limitations for enterprise data work:
- No Formal Specs — The agent inferred requirements by reading docs it found. If the docs are wrong or absent, the agent builds the wrong thing.
- No Quality Gates — The agent validates against its own tests. There is no independent critic. Self-assessment is unreliable.
- No Governance — No data classification, no access control audit, no GDPR compliance check. The agent does not know about regulatory requirements.
- No Accountability — If the model fails in production, there is no RACI trace. Who decided the churn definition? The agent. Who approved the model? The agent. Who tested it? The agent.
- No Domain Knowledge — The agent starts fresh every time. It does not know that SimShop's CRM data has NULL customer_ids, or that November-December orders are seasonal.
- No Lifecycle Awareness — The agent built and trained a model. It did not set up drift monitoring, canary deployment, or retraining triggers. It solved a task, not a lifecycle.
- The Self-Validation Problem. When the same entity that builds the system also tests the system, the tests are biased toward what was built rather than what was specified. This is why every engineering discipline separates construction from inspection. Agentic coding collapses this separation.
Paradigm 3: Spec-Driven Development #
What It Is #
Spec-driven development, as implemented in Neam, introduces a fundamentally different workflow:
- Humans encode expertise in structured specifications -- business requirements, acceptance criteria, domain knowledge (Agent.MD), quality gates, and governance policies
- Specialist agents execute within the boundaries defined by those specifications
- Independent agents validate results against the original specifications
- Full traceability connects every artifact to the business need that motivated it
The specifications are not prompts. They are machine-readable, compiler-validated declarations that agents reason about, not just text that agents interpret.
A Spec-Driven Example in Neam #
Here is the same churn prediction project in Neam's spec-driven approach:
// Step 1: Business requirements -- authored by humans, executed by agents
analyst_spec ChurnPrediction {
objective: "Predict customer churn to enable proactive retention",
success_criteria: {
primary_metric: "auc_roc",
threshold: 0.80,
business_impact: "Reduce monthly churn from 8.2% to 6.5%"
},
acceptance_criteria: [
given("a customer with no purchase AND no login for 90 days")
when("excluding seasonal gift-only purchasers in Nov-Dec")
then("classify as churned"),
given("a trained churn model")
when("evaluated on holdout test set")
then("AUC-ROC exceeds 0.80"),
given("a churn prediction for any customer")
when("requested by a non-technical stakeholder")
then("provide SHAP-based explanation per GDPR Article 22")
],
data_sources: ["sales.orders", "crm.customers",
"analytics.login_events", "support.tickets"]
}
// Step 2: Agent.MD -- human domain knowledge encoded for agents
// @agent-md ChurnDomainKnowledge
// ## @causal-domain-knowledge
// - Support quality -> customer satisfaction -> retention (established)
// - Price sensitivity varies by customer segment (test before assuming)
// - November-December purchases are often gifts, not regular behavior
//
// ## @known-data-issues
// - CRM customer_id has 3.2% NULL rate (exclude, do not impute)
// - Event stream has timezone inconsistencies (normalize to UTC)
// - Order deduplication gap exists in bulk import batches
//
// ## @methodology-preferences
// - Prefer gradient boosting for tabular churn prediction
// - Always check for class imbalance before training
// - Use Bayesian approaches for causal analysis when sample allows
// Step 3: Quality gates -- blocking criteria that prevent bad deployments
quality_gate ChurnQualityGate {
data_quality_gate: {
required_pass_rate: 1.0,
blocking: true,
checks: ["no_null_keys", "timezone_normalized", "dedup_verified"]
},
model_quality_gate: {
required_metrics: { auc_roc: "> 0.80", precision: "> 0.70" },
blocking: true
},
explainability_gate: {
shap_values_available: true,
gdpr_article_22_compliant: true,
blocking: true
},
api_quality_gate: {
latency_p99: "< 200ms",
error_rate: "< 0.1%",
blocking: true
}
}
// Step 4: Orchestration -- the DIO assembles the right agents
dio_task ChurnProject {
spec: ChurnPrediction,
quality_gate: ChurnQualityGate,
phases: [
{ agent: "data_ba", role: "requirements_analysis" },
{ agent: "etl", role: "data_engineering" },
{ agent: "data_scientist", role: "model_training" },
{ agent: "causal", role: "root_cause_analysis" },
{ agent: "data_test", role: "quality_validation" },
{ agent: "mlops", role: "deployment" },
{ agent: "dataops", role: "monitoring" }
],
coordination: "centralized_raci"
}
Notice what is different. The churn definition is not buried in a Python comment -- it is a formal acceptance criterion that the DataTest Agent will generate test cases from. The data quality issues are not discovered in Week 16 -- they are declared in Agent.MD so every agent knows about them from the start. The GDPR requirement is not forgotten -- it is a blocking quality gate that prevents deployment without explainability.
- Specs as the Source of Truth. In vibe coding, the prompt is the spec (and it disappears). In agentic coding, the code is the spec (and it has no business context). In spec-driven development, the spec is the spec -- it persists, is versioned, is reviewed, and drives both implementation and validation.
The Control-Speed-Quality Trilemma #
Every software development approach makes tradeoffs between three properties:
flowchart TD
subgraph Legend ["Paradigm Positioning"]
V["Vibe Coding\nMax Speed\nLow Control, Low Quality"]
A["Agentic Coding\nMedium Speed + Quality\nLimited Control"]
S["Spec-Driven\nHigh Control + Quality + Speed\n(via upfront specs)"]
end
S -->|"upfront spec investment\nenables all three"| V
V -. "low governance\nhigh risk" .-> A
A -. "no formal specs\nno independent validation" .-> S
Vibe coding maximizes speed at the expense of control and quality. You get code in seconds, but you have no governance and limited reliability.
Agentic coding improves on quality (through self-iteration) while maintaining speed, but control remains limited. The agent makes autonomous decisions with no formal accountability.
Spec-driven development achieves all three -- but only because the upfront investment in specifications and Agent.MD creates the constraints that make speed, quality, and control simultaneously possible.
Side-by-Side Comparison #
| Dimension | Vibe Coding | Agentic Coding | Spec-Driven |
|---|---|---|---|
| Input | Natural lang prompt | Task + tools | Specs + Agent.MD + quality gates |
| Who decides what to build | The AI | The agent | Humans (specs) |
| Who decides how to build | The AI | The agent | Agents (within spec boundaries) |
| Who validates | The human (manually) | The agent (self-test) | Independent DataTest Agent |
| Traceability | None | Code-level | Business need → Spec → Code → Test → Deploy |
| Quality gates | None | Agent-defined | Formal, blocking, compiler-checked |
| Governance | None | None | GDPR, RBAC, audit trail |
| Domain knowledge | Ephemeral (in prompt) | Per-session (in memory) | Persistent (in Agent.MD) |
| Accountability when it fails | Nobody | The agent | RACI-assigned with audit trail |
| Lifecycle coverage | Single task | Single task (iterated) | Full lifecycle (7 phases) |
| Reproducibility | Low | Medium | 100% (50/50 in DataSims) |
The Key Distinction: Prompts vs. Specifications #
The most important conceptual difference between these paradigms is the distinction between a prompt and a specification.
A prompt is:
- Natural language, ambiguous by nature
- Ephemeral -- it exists for one interaction
- Unversioned -- no history, no review process
- Unvalidatable -- there is no way to automatically check whether the output matches the prompt's intent
A specification is:
- Structured, machine-readable, and precise
- Persistent -- it lives in version control alongside the code
- Reviewable -- teams can inspect, discuss, and approve it
- Validatable -- acceptance criteria can be automatically tested
- "Build a churn model that predicts which customers will leave"
- Ambiguous: what is "leave"?
- Incomplete: what data?
- No success criteria
- No quality gates
- No governance
- Disappears after use
- Precise churn definition: no purchase AND no login for 90 days, excluding seasonal gift-only purchasers
- Testable acceptance criteria
- Measurable success threshold: AUC-ROC > 0.80
- Versioned in source control
- Compiler-validated
- Treating Prompts as Specs. When an organization uses chat-based AI to generate production data pipelines, they are implicitly treating the prompt as the specification. Every ambiguity in the prompt becomes a decision that the AI makes silently -- without review, without accountability, without traceability. This is the root cause of the "it works on my laptop" syndrome at AI-assisted scale.
When to Use Which Paradigm #
None of these paradigms is universally wrong. The right choice depends on the stakes:
- Low Stakes → Vibe Coding: Prototyping, ad-hoc analysis, learning/exploration. Fast, cheap, disposable. No governance needed. Errors are cheap.
- Medium Stakes → Agentic Coding: Internal tools, proof of concepts, non-regulated domains. Self-correcting, more robust than vibe coding. Acceptable when full traceability not required.
- High Stakes → Spec-Driven: Production ML pipelines, regulated industries, revenue-impacting decisions, multi-team coordination, long-lived systems. Full traceability, quality gates, governance, RACI accountability. Reproducible outcomes.
The 85% failure rate reported by Gartner applies to high-stakes projects -- the ones that involve production deployment, cross-functional teams, and business-critical decisions. These are precisely the projects where vibe coding and agentic coding fall short, and where spec-driven development provides the structural guarantees needed for success.
- Classify Your Projects. List your team's last five data/ML initiatives. For each one, classify the stakes (low, medium, high) and the paradigm that was actually used. How many high-stakes projects used a low-stakes paradigm? That mismatch is a leading indicator of the 85% failure pattern.
Industry Perspective #
The paradigm debate is not limited to data engineering. It mirrors conversations happening across software development:
Infrastructure as Code (Terraform, Pulumi) replaced ad-hoc server provisioning with declarative specifications. The same servers get deployed, but now the infrastructure is versioned, reviewable, and reproducible. Spec-driven development applies this principle to the entire data lifecycle.
BABOK v3 (Business Analysis Body of Knowledge) defines 50 elicitation techniques and emphasizes that requirements must be traceable, testable, and verifiable. Spec-driven development encodes these principles as first-class language constructs rather than process guidelines that teams may or may not follow.
DAMA-DMBOK (Data Management Body of Knowledge) emphasizes data governance, quality management, and metadata management as organizational capabilities -- not afterthoughts. Spec-driven development makes governance a compile-time concern, not a post-deployment audit.
MLOps Maturity Model (Google/Microsoft) defines Level 0 (manual) through Level 4 (fully automated). Most organizations are at Level 0-1. Spec-driven development with autonomous agents targets Level 3-4 by making automation the default and manual intervention the exception.
The Evidence #
The three paradigms produce measurably different outcomes. Using the DataSims churn prediction experiment, we can compare what each paradigm would produce:
| Metric | Vibe Coding | Agentic Coding | Spec-Driven (Neam) |
|---|---|---|---|
| Time to first output | Minutes | Hours | Hours |
| Churn definition correct | No (simplified) | Partial (inferred) | Yes (from spec) |
| Data quality handled | No | Partial (discovered) | Yes (from Agent.MD) |
| Support features included | No | Maybe (if discovered) | Yes (from spec) |
| Train/test split | No (common omission) | Yes (after iteration) | Yes (standard practice) |
| Model AUC | ~0.68 (mislabeled data) | ~0.76 (partial features) | 0.847 (full pipeline) |
| GDPR compliance | No | No | Yes (quality gate) |
| Quality gate passage | N/A (no gates) | N/A (self-assessed) | Yes (independent agent) |
| Test coverage | 0% | ~40% (agent-written) | 94% (auto-generated) |
| Reproducibility | Low | Medium | 100% (50/50 runs) |
| Production-ready | No | Partially | Yes |
The spec-driven approach does not just produce better code. It produces a better outcome because the specifications prevent the errors that vibe coding and agentic coding allow through by default -- wrong definitions, missing features, absent governance, and untested assumptions.
These results come from the DataSims evaluation platform, where the full Neam agent stack was tested 50 times across 10 experimental conditions with 100% success rate.
Key Takeaways #
- Three paradigms exist for AI-assisted development: vibe coding (prompt to code), agentic coding (agent plans and executes), and spec-driven development (specs guide agents within boundaries).
- The critical distinction is prompts vs. specifications. Prompts are ambiguous, ephemeral, and unvalidatable. Specifications are precise, persistent, versioned, and automatically testable.
- The control-speed-quality trilemma is real -- but spec-driven development resolves it by investing upfront in specifications that enable both autonomous execution and formal validation.
- Match the paradigm to the stakes. Vibe coding for prototypes, agentic coding for internal tools, spec-driven for production systems where reliability, governance, and traceability matter.
- Spec-driven development is the only paradigm that covers the full data lifecycle -- from requirements through deployment and monitoring -- with formal accountability at every boundary.
For Further Exploration #
- Neam: The AI-Native Programming Language -- See the full Neam language syntax for specifications, quality gates, and agent declarations
- DataSims Repository -- Compare paradigm outcomes using the SimShop evaluation environment
- BABOK v3, International Institute of Business Analysis -- The standard for requirements traceability
- DAMA-DMBOK 2.0 -- Data Management Body of Knowledge
- Google MLOps Maturity Model -- Levels 0-4 for ML operations automation