Back to Blog
5 min read
ai-engineering

Your Text-to-SQL Is Lying to You

Academic benchmarks show 90%+ accuracy. Production systems hit 10-31%. The dangerous gap is semantic errors that look correct but aren't.

ai-engineeringtext-to-sqldata-engineeringsemantic-layerllm

Your Text-to-SQL Is Lying to You

Text-to-SQL is the most requested enterprise AI capability. It's also one of the most dangerous to ship without guardrails.

Academic benchmarks report 90%+ accuracy. Real enterprise schemas? 10-31%. The gap isn't a model problem. It's an architecture problem.

The Danger: Errors You Don't Catch

Syntax errors are fine — the database throws an exception, you retry. The real threat is semantic errors: queries that execute perfectly and return plausible-looking numbers that are completely wrong.

The "Fan Trap" Example

A user asks: "What's our total revenue this quarter?"

The LLM generates:

SELECT SUM(amount) FROM transactions
WHERE created_at >= '2026-01-01';

This returns a number. It looks reasonable. But it's wrong because:

  • It includes refunded transactions
  • It includes voided orders
  • It sums amount instead of net_revenue (which accounts for discounts)
  • It doesn't filter by status = 'completed'

The correct query requires tribal knowledge the model doesn't have:

SELECT SUM(net_revenue) FROM transactions
WHERE created_at >= '2026-01-01'
  AND status = 'completed'
  AND refund_id IS NULL;

The delta between these two numbers could be millions of dollars. And nothing in the system flags it as wrong.

Why Raw Schemas Fail

Problem 1: No Business Context

A DDL dump tells the model column names and types. It cannot convey:

  • What "active user" means at your company (logged in last 30 days? has a subscription? completed onboarding?)
  • Which joins are safe vs. which create duplicates
  • Which columns are deprecated but still in the schema
  • Implicit filters your analysts always apply

Problem 2: Context Window Saturation

Real enterprise databases have thousands of tables and tens of thousands of columns. Dumping the full DDL:

  • Blows up token costs
  • Introduces the "lost-in-the-middle" effect — models lose focus on relevant tables buried in the noise
  • Degrades reasoning quality proportional to context length

The Fix: A Semantic Layer as Code

Tools like dbt Semantic Layer, Cube, and AtScale define business metrics as version-controlled code. Instead of the model navigating raw tables, it queries predefined, governed metrics.

# dbt semantic layer definition
metrics:
  - name: total_revenue
    description: "Net revenue from completed, non-refunded transactions"
    type: sum
    sql: net_revenue
    timestamp: created_at
    filters:
      - field: status
        operator: "="
        value: "'completed'"
      - field: refund_id
        operator: "is"
        value: "null"

Now when the LLM handles "total revenue," it doesn't guess. It calls the total_revenue metric, and the semantic layer compiles the correct SQL with all the tribal knowledge baked in.

ApproachAccuracyFailure Mode
Raw DDL injection10-31%Semantic hallucinations, wrong joins
Semantic Layer90-99%Only fails on undefined metrics

The Multi-Agent Validation Loop

Even with a semantic layer, a single LLM call isn't enough for production. The architecture that works:

Each agent has a single responsibility:

  • Schema Agent: Retrieves only the relevant metrics/dimensions (keeps context small)
  • Query Generator: Writes SQL against semantic metrics (not raw tables)
  • Validation Critic: Reviews the query for logical errors before execution

This pipeline catches errors that a single model call would silently propagate.

Practical Takeaways

  1. Never expose raw schemas to an LLM — if you don't have a semantic layer, start with a metrics dictionary as a minimum viable alternative
  2. Treat Text-to-SQL output as untrusted — every generated query needs a validation step before it touches a dashboard
  3. Measure semantic accuracy, not syntax accuracy — your benchmark should compare query results against ground truth, not just whether the SQL parses
  4. Invest in the semantic layer first — it's a data engineering problem masquerading as an AI problem

The LLM is not the bottleneck. The absence of codified business knowledge is.


Shipping Text-to-SQL in production? I'd love to hear about the edge cases you've hit — hello@sowmith.dev

S

Sowmith Mandadi

Full-Stack Developer & AI Engineer