Your Text-to-SQL Is Lying to You

Text-to-SQL is the most requested enterprise AI capability. It's also one of the most dangerous to ship without guardrails.

Academic benchmarks report 90%+ accuracy. Real enterprise schemas? 10-31%. The gap isn't a model problem. It's an architecture problem.

The Danger: Errors You Don't Catch

Syntax errors are fine — the database throws an exception, you retry. The real threat is semantic errors: queries that execute perfectly and return plausible-looking numbers that are completely wrong.

The "Fan Trap" Example

A user asks: "What's our total revenue this quarter?"

The LLM generates:

SELECT SUM(amount) FROM transactions
WHERE created_at >= '2026-01-01';

This returns a number. It looks reasonable. But it's wrong because:

It includes refunded transactions
It includes voided orders
It sums amount instead of net_revenue (which accounts for discounts)
It doesn't filter by status = 'completed'

The correct query requires tribal knowledge the model doesn't have:

SELECT SUM(net_revenue) FROM transactions
WHERE created_at >= '2026-01-01'
  AND status = 'completed'
  AND refund_id IS NULL;

The delta between these two numbers could be millions of dollars. And nothing in the system flags it as wrong.

Why Raw Schemas Fail

Problem 1: No Business Context

A DDL dump tells the model column names and types. It cannot convey:

What "active user" means at your company (logged in last 30 days? has a subscription? completed onboarding?)
Which joins are safe vs. which create duplicates
Which columns are deprecated but still in the schema
Implicit filters your analysts always apply

Problem 2: Context Window Saturation

Real enterprise databases have thousands of tables and tens of thousands of columns. Dumping the full DDL:

Blows up token costs
Introduces the "lost-in-the-middle" effect — models lose focus on relevant tables buried in the noise
Degrades reasoning quality proportional to context length

The Fix: A Semantic Layer as Code

Tools like dbt Semantic Layer, Cube, and AtScale define business metrics as version-controlled code. Instead of the model navigating raw tables, it queries predefined, governed metrics.

# dbt semantic layer definition
metrics:
  - name: total_revenue
    description: "Net revenue from completed, non-refunded transactions"
    type: sum
    sql: net_revenue
    timestamp: created_at
    filters:
      - field: status
        operator: "="
        value: "'completed'"
      - field: refund_id
        operator: "is"
        value: "null"

Now when the LLM handles "total revenue," it doesn't guess. It calls the total_revenue metric, and the semantic layer compiles the correct SQL with all the tribal knowledge baked in.

Approach	Accuracy	Failure Mode
Raw DDL injection	10-31%	Semantic hallucinations, wrong joins
Semantic Layer	90-99%	Only fails on undefined metrics

The Multi-Agent Validation Loop

Even with a semantic layer, a single LLM call isn't enough for production. The architecture that works:

Each agent has a single responsibility:

Schema Agent: Retrieves only the relevant metrics/dimensions (keeps context small)
Query Generator: Writes SQL against semantic metrics (not raw tables)
Validation Critic: Reviews the query for logical errors before execution

This pipeline catches errors that a single model call would silently propagate.

Practical Takeaways

Never expose raw schemas to an LLM — if you don't have a semantic layer, start with a metrics dictionary as a minimum viable alternative
Treat Text-to-SQL output as untrusted — every generated query needs a validation step before it touches a dashboard
Measure semantic accuracy, not syntax accuracy — your benchmark should compare query results against ground truth, not just whether the SQL parses
Invest in the semantic layer first — it's a data engineering problem masquerading as an AI problem

The LLM is not the bottleneck. The absence of codified business knowledge is.

Shipping Text-to-SQL in production? I'd love to hear about the edge cases you've hit — hello@sowmith.dev

Your Text-to-SQL Is Lying to You

Your Text-to-SQL Is Lying to You

The Danger: Errors You Don't Catch

The "Fan Trap" Example

Why Raw Schemas Fail

Problem 1: No Business Context

Problem 2: Context Window Saturation

The Fix: A Semantic Layer as Code

The Multi-Agent Validation Loop

Practical Takeaways

Sowmith Mandadi

Related Posts

From 6 Hours to 2: Fixing the Memory Wall in Batch AI Pipelines

GraphRAG Looks Great Until Entity Resolution Breaks

We Moved from React to Svelte 5 and Cut Our Bundle by 60%