Your Text-to-SQL Is Lying to You
Text-to-SQL is the most requested enterprise AI capability. It's also one of the most dangerous to ship without guardrails.
Academic benchmarks report 90%+ accuracy. Real enterprise schemas? 10-31%. The gap isn't a model problem. It's an architecture problem.
The Danger: Errors You Don't Catch
Syntax errors are fine — the database throws an exception, you retry. The real threat is semantic errors: queries that execute perfectly and return plausible-looking numbers that are completely wrong.
The "Fan Trap" Example
A user asks: "What's our total revenue this quarter?"
The LLM generates:
SELECT SUM(amount) FROM transactions
WHERE created_at >= '2026-01-01';
This returns a number. It looks reasonable. But it's wrong because:
- It includes refunded transactions
- It includes voided orders
- It sums
amountinstead ofnet_revenue(which accounts for discounts) - It doesn't filter by
status = 'completed'
The correct query requires tribal knowledge the model doesn't have:
SELECT SUM(net_revenue) FROM transactions
WHERE created_at >= '2026-01-01'
AND status = 'completed'
AND refund_id IS NULL;
The delta between these two numbers could be millions of dollars. And nothing in the system flags it as wrong.
Why Raw Schemas Fail
Problem 1: No Business Context
A DDL dump tells the model column names and types. It cannot convey:
- What "active user" means at your company (logged in last 30 days? has a subscription? completed onboarding?)
- Which joins are safe vs. which create duplicates
- Which columns are deprecated but still in the schema
- Implicit filters your analysts always apply
Problem 2: Context Window Saturation
Real enterprise databases have thousands of tables and tens of thousands of columns. Dumping the full DDL:
- Blows up token costs
- Introduces the "lost-in-the-middle" effect — models lose focus on relevant tables buried in the noise
- Degrades reasoning quality proportional to context length
The Fix: A Semantic Layer as Code
Tools like dbt Semantic Layer, Cube, and AtScale define business metrics as version-controlled code. Instead of the model navigating raw tables, it queries predefined, governed metrics.
# dbt semantic layer definition
metrics:
- name: total_revenue
description: "Net revenue from completed, non-refunded transactions"
type: sum
sql: net_revenue
timestamp: created_at
filters:
- field: status
operator: "="
value: "'completed'"
- field: refund_id
operator: "is"
value: "null"
Now when the LLM handles "total revenue," it doesn't guess. It calls the total_revenue metric, and the semantic layer compiles the correct SQL with all the tribal knowledge baked in.
| Approach | Accuracy | Failure Mode |
|---|---|---|
| Raw DDL injection | 10-31% | Semantic hallucinations, wrong joins |
| Semantic Layer | 90-99% | Only fails on undefined metrics |
The Multi-Agent Validation Loop
Even with a semantic layer, a single LLM call isn't enough for production. The architecture that works:
Each agent has a single responsibility:
- Schema Agent: Retrieves only the relevant metrics/dimensions (keeps context small)
- Query Generator: Writes SQL against semantic metrics (not raw tables)
- Validation Critic: Reviews the query for logical errors before execution
This pipeline catches errors that a single model call would silently propagate.
Practical Takeaways
- Never expose raw schemas to an LLM — if you don't have a semantic layer, start with a metrics dictionary as a minimum viable alternative
- Treat Text-to-SQL output as untrusted — every generated query needs a validation step before it touches a dashboard
- Measure semantic accuracy, not syntax accuracy — your benchmark should compare query results against ground truth, not just whether the SQL parses
- Invest in the semantic layer first — it's a data engineering problem masquerading as an AI problem
The LLM is not the bottleneck. The absence of codified business knowledge is.
Shipping Text-to-SQL in production? I'd love to hear about the edge cases you've hit — hello@sowmith.dev