Why Text-to-SQL Demos Fail in Production

What context actually matters when LLMs write database queries

February 2025·15 min read·

The premise of text-to-SQL is appealing: a user asks a question in plain English, and an AI writes the SQL. The demos look impressive. But how much of that holds up under realistic conditions?

This post is an empirical test. We run the same questions through progressively richer context layers and measure what actually changes. The structure is simple: start with the baseline (schema only), add descriptions and examples, and track how accuracy shifts at each step. The goal is to make visible what kinds of context matter, and where the remaining failures cluster.

Each section includes live test results from a real database, with all phrasing variants and failure modes shown. The tests are reproducible. The analysis is based on what actually happened, not what should have happened.

Results Summary

We tested three approaches across three query types, running each phrasing variant 20 times. Here's what we found:

Approach	Accuracy	Change
1. Schema Only	61%	baseline
2. Schema + Descriptions	79%	+18
3. Schema + Descriptions + Examples	95%	+16

Important Caveats

These results come with significant limitations. The test cases are carefully selected, single-table or two-table queries with clear expected outputs. Real users don't ask questions this cleanly. They ask follow-up questions, reference previous context, use company-specific terminology, and expect the system to understand implicit constraints.

Scaling these approaches to arbitrary queries is a different problem entirely. Our RAG examples work because we curated roughly 30 question-SQL pairs that cover the query patterns we're testing. For a real product, you'd need examples covering every way users might ask about every metric in your database. That's a combinatorial explosion. And every time someone asks a question that doesn't match your examples well, accuracy drops back toward the baseline.

What we didn't include: Several query types failed so consistently that we excluded them from the final test suite. Queries requiring multi-step reasoning ("What percentage of our revenue comes from customers who joined this year?") rarely produced correct SQL. Queries with relative time references ("How did last month compare to the month before?") failed because the model doesn't know what "today" is. Queries requiring business logic not in the schema ("Which customers are at risk of churning?") had no path to success. And queries that needed to combine data across more than two tables often produced valid SQL that returned wrong results due to incorrect join conditions.

The 95% accuracy in Approach 3 is best understood as a ceiling for these specific, well-defined queries with extensive prompt engineering and curated examples. In a free-form production environment, expect significantly lower accuracy. The gap between demo and production is real.

What the Results Do Show

The pattern is consistent: richer context improves accuracy. But the improvements aren't automatic. Descriptions only help when they target actual failure modes. Examples only help when they don't conflict with each other. And some failures persist regardless of context, particularly questions that are fundamentally ambiguous.

The hardest failure mode to solve is entity disambiguation. "Who are the people we work with?" could mean customers or vendors. No amount of schema documentation resolves that ambiguity because the question itself is underspecified. This suggests that 100% accuracy on arbitrary natural language isn't achievable without either constraining the input or adding a clarification step.

The Setup

All queries run against financial data for a fictional startup called DataPulse Analytics: bank accounts, transactions, invoices, bills, customers, and vendors. The data lives in a Postgres database hosted on Supabase.

Model: Claude Sonnet 4 (claude-sonnet-4-20250514) via the Anthropic API. Each query is a single LLM call with no post-processing: we send the prompt, extract the SQL from the response, and execute it directly. No validation, no retry logic, no agentic refinement. This represents the simplest possible implementation, which makes the results easier to interpret but also means there's room for improvement with more sophisticated pipelines.

Evaluation: Each test has a known correct SQL query. We run both the generated SQL and the expected SQL against the live database, then compare results:

For list queries: We match rows by id column if present, or by identifying columns like display_name or email. Column names can differ as long as the same records are returned.
For aggregation queries: We compare numeric values directly, with a small tolerance for floating-point differences. Column aliases don't matter. If the expected query returns 720000.00 as net_cash_flow and the generated query returns the same value as total, that's a pass.

Phrasing Variants

Users ask questions in different ways. To test robustness, we run each test with five phrasing styles:

Direct: Clear, straightforward question (e.g., "Show me our 5 most recent customers")
Ambiguous: Omits specifics, uses imprecise language (e.g., "Who are some of the newer people we work with?")
Verbose: Over-specified with extra context and criteria (e.g., "Retrieve the 5 most recently added customer records, sorted by creation date descending...")
Explicit: Uses SQL terminology and structure hints (e.g., "SELECT from customers ORDER BY created_at DESC LIMIT 5")
Conversational: Informal tone, colloquial language (e.g., "Hey, who are our newest customers? Just show me like 5 of them.")

For each variant, we run a minimum of 10 tests to measure consistency. The charts below show accuracy across all variants.

Approach 1: Schema Only

Results summary: 61% overall accuracy across three test cases. Simple queries with explicit entity names succeed (80% for "recent customers"). Queries requiring domain knowledge fail (20% for vendor payments). The main failure mode is entity disambiguation: "people we work with" could mean customers or vendors, and the model has no way to know which.

The simplest approach. Before generating SQL, we fetch the database schema from Postgres. The model receives:

Table names and column names
Data types
Foreign key relationships
Enum values

That's it. No descriptions, no examples, no business context. Just the raw structure.

Click above to see the exact schema context passed to Claude.

Test: Recent Customers

We start with something that sounds simple but requires understanding ordering and limits. The model needs to know what "most recent" means and how to limit results.

We test this query with 5 phrasing styles:

Direct: "Show me our 5 most recent customers"
Ambiguous: "Who are some of the newer people we work with?"
Verbose: "Retrieve the 5 most recently added customer records from the database, sorted by their creation date in descending order, including their name, company, and contact information"
Explicit: "SELECT from customers ORDER BY created_at DESC LIMIT 5"
Conversational: "Hey, who are our newest customers? Just show me like 5 of them."

Expected behavior: Query bi.customers filtered by business_id, order by created_at DESC, and limit to 5 rows.

Results

Approach 1 (Schema Only) - List Recent Customers: 80% overall accuracy. Direct 100%, Verbose 100%, Explicit 100%, Casual 100%, Ambiguous 0%. The model succeeds when 'customers' is mentioned explicitly but fails completely on vague phrasings like 'people we work with'.

Analysis

At 80% overall accuracy, this query performs well when the question contains clear entity references. Four out of five phrasing styles achieve 100% accuracy. The model correctly identifies the customers table, applies the right ordering, and limits results appropriately.

The single point of failure is vague phrasing (0%). "Who are some of the newer people we work with?" doesn't contain the word "customers." Without that anchor, the model has to guess which entity type the question refers to, and it consistently guesses wrong, querying vendors instead.

Why It Works (When It Works)

The model succeeds when keywords map directly to schema elements. "Customers" maps to the customers table. "Most recent" maps to ORDER BY created_at DESC. "5" maps to LIMIT 5. When these pieces are present, the model assembles them correctly.

Direct, detailed, technical, and casual (100%): All these variants contain either "customers" explicitly or enough context to infer it. The model handles variations in tone and verbosity well when the core entity is identifiable.

Why It Fails

Ambiguous phrasing (0%): "Who are some of the newer people we work with?" loses every anchor. The model doesn't know if "people" means customers, vendors, or employees. In our tests, it consistently queried the vendors table instead, returning companies like WeWork and AWS instead of customer records. The phrase "work with" can mean either direction of a business relationship.

Sometimes the model hedged by querying both tables with a UNION, returning 16 rows instead of 5. This is a reasonable interpretation given the ambiguity, but still fails our test, which expects exactly 5 customers.

Test: Spending by Category

Now something harder. We want to see spending broken down by category. This requires the model to:

Identify that bank_transactions is the right table
Join through bank_accounts to filter by business_id
Understand Plaid's sign convention (positive amounts = money out)
Group by category and order by total

We test this query with 5 phrasing styles:

Direct: "What is our spending by category?"
Ambiguous: "What are we doing with our money in different areas?"
Verbose: "Provide a comprehensive breakdown of total expenditures categorized by expense type, showing dollar amounts for each category, sorted by total amount spent in descending order"
Explicit: "Execute an aggregate query on bank_transactions to calculate SUM of amounts grouped by expense category, filtering for debit transactions"
Conversational: "Hey, what are we spending our money on? Can you break it down by category?"

Expected behavior: Join bank_transactions to bank_accounts, filter for positive amounts (outflows), group by category_primary, return totals in descending order.

Results

Approach 1 (Schema Only) - Spending by Category: 83% overall accuracy. Direct 100%, Verbose 100%, Explicit 100%, Casual 95%, Ambiguous 25%. The keyword 'category' provides a strong anchor to the correct column. Vague phrasings fail because the model over-specifies grouping.

Analysis

At 83% overall accuracy, this query performs well across most phrasing styles. The keyword "spending" or "category" provides a strong anchor to bank_transactions.category_primary. When either term appears, the model finds the right path.

The main failure mode is vague phrasing (25%). "What are we spending money on?" drops the word "category," and the model sometimes over-specifies the grouping. It adds merchant_name or category_detailed to the GROUP BY, which changes the result set.

Why It Works (When It Works)

Direct and detailed phrasing (100%): "Spending by category" maps cleanly. The model identifies bank_transactions.category_primary, infers the join through bank_accounts for the business filter, and applies the correct sign convention. Foreign keys in the schema provide enough structural hints.

Technical phrasing (100%): SQL-like syntax guides the model directly to the correct structure.

Casual phrasing (95%): The word "spending" anchors the query even in casual phrasing. "What are we spending our money on?" contains the same semantic content as the direct version.

Why It Fails

Vague phrasing (25%): "What are we spending money on?" is ambiguous about granularity. The model sometimes groups by merchant_name in addition to category, or includes category_detailed in the output. These queries are technically reasonable interpretations but don't match our expected output.

Some failures are SQL errors. The model generates invalid GROUP BY clauses where aggregated columns aren't properly grouped. This suggests the vague phrasing makes the model less careful about SQL correctness.

Test: Vendor Payments

This query tests whether the model understands the relationship between vendors and bills. The challenge is knowing that bills.amount_paid tracks actual payments to vendors, not bills.amount (total owed) or bank transactions.

We test this query with 5 phrasing styles:

Direct: "How much have we paid to each vendor?"
Ambiguous: "Who have we paid?"
Verbose: "List each vendor and the total amount we have paid them based on our bill payment records."
Explicit: "SELECT vendors JOIN bills, SUM(amount_paid) GROUP BY vendor"
Conversational: "Show me how much we have paid each of our vendors"

Expected behavior: Join vendors to bills, sum amount_paid (not amount), group by vendor, order by total paid.

Results

Approach 1 (Schema Only) - Vendor Payments: 20% overall accuracy. Explicit 100%, all other styles 0%. This query fails because nothing in the schema explains that bills.amount_paid represents payments to vendors. Only SQL-like phrasing succeeds.

Analysis

At 20% overall accuracy, this query exposes a fundamental limitation of schema-only context. The model consistently fails because nothing in the schema explains that amount_paid on bills represents actual payments made to vendors.

The only variant that succeeds is the explicit technical phrasing (100%), which directly mentions "JOIN bills, SUM(amount_paid)". When the question mirrors the SQL structure exactly, the model echoes it back. Every other phrasing fails completely.

This is a key finding: some queries require domain knowledge that cannot be inferred from structure alone. The schema shows that bills have both amount and amount_paid columns, but doesn't explain when to use which. Without annotations, the model guesses, and guesses wrong.

Why It Fails

Original phrasing (0%): "How much have we paid to each vendor?" sounds clear, but the model typically queries bank_transactions looking for payments to vendors, or uses bills.amount (what we owe) instead of bills.amount_paid (what we've actually paid).

Vague phrasing (0%): "Who have we paid?" is too ambiguous. The model doesn't know if we want customer refunds, vendor payments, or employee salaries.

Detailed phrasing (0%): Even mentioning "bill payment records" doesn't help. The model still uses the wrong column (amount vs amount_paid).

Approach 2: Schema + Descriptions

Results summary: 79% overall accuracy (+18 from Approach 1). The biggest improvement is vendor payments, jumping from 20% to 61%. Spending by category also improves from 83% to 96%. Recent customers stays flat at 80% because the schema already contained enough information for that query. The remaining failures are mostly vague phrasings that require entity disambiguation.

Approach 1 gave the model structure but no meaning. This approach adds documentation: what each table represents, what each column means, example values, and usage notes. The hypothesis is that business context will help the model make better inferences.

It's worth noting that these annotations didn't work on the first try. We iterated through several versions, running tests, identifying failure patterns, and rewriting descriptions before landing on documentation that actually improved accuracy. More context isn't automatically better. It needs to target the specific failure modes, and poorly written annotations can make things worse.

The model now receives:

Everything from Approach 1 (table names, column types, foreign keys, enums)
Table descriptions: What each table represents in business terms
Column descriptions: What each column means and how to interpret it
Usage notes: Important business rules (like Plaid's sign convention)
Examples: Sample values to clarify data formats

Click above to see the annotations passed to Claude.

Test: Recent Customers

The same test as Approach 1. The annotations now explain that "customers" are "people or companies that owe you money" and that created_at indicates "when the customer was added."

We test this query with 5 phrasing styles:

Direct: "Show me our 5 most recent customers"
Ambiguous: "Who are some of the newer people we work with?"
Verbose: "Retrieve the 5 most recently added customer records from the database, sorted by their creation date in descending order, including their name, company, and contact information"
Explicit: "SELECT from customers ORDER BY created_at DESC LIMIT 5"
Conversational: "Hey, who are our newest customers? Just show me like 5 of them."

Results

Approach 2 (Schema + Descriptions) - List Recent Customers: 80% overall accuracy (unchanged from Approach 1). Direct 100%, Verbose 100%, Explicit 100%, Casual 100%, Ambiguous 0%. Annotations don't help because the schema already contained enough information.

Analysis

At 80% overall accuracy (unchanged from Approach 1), this test shows that annotations don't always help. The list-customers query was already performing well with schema-only context. Both approaches achieve 100% on direct, detailed, technical, and casual phrasings. Both fail completely on vague phrasing.

This is an important finding: annotations add value when they provide information the model can't infer from structure. For this query, the schema already contains everything needed. The customers table is clearly named, created_at is a standard timestamp, and the foreign key to businesses is explicit.

Why It Works (When It Works)

All structured phrasings (100%): Direct, detailed, technical, and casual variants all succeed. Despite their differences in tone and verbosity, each contains either "customers" or enough context to map to the right table.

The annotations don't hurt. They just don't add value for this particular query. The column descriptions confirm what the schema already makes clear.

Why It Fails

Ambiguous phrasing (0%): Annotations don't solve entity disambiguation. The phrase "people we work with" could mean customers (people who pay us) or vendors (people we pay). Without either term in the question, the model has no anchor.

In our tests, the model consistently interpreted "people we work with" as vendors. The annotations even describe customers as "people or companies that owe you money" and vendors as "people or companies you owe money to." But which one matches "people we work with" requires business context the model doesn't have: DataPulse Analytics is a B2B company where "working with" typically refers to clients, not suppliers.

This failure mode is instructive. Annotations help when the question can be mapped to schema elements. They don't help when the mapping itself is ambiguous. Solving this would require either terminology mappings ("people we work with" → customers) or a conversational system that asks for clarification.

Test: Spending by Category

The annotations now explicitly document Plaid's sign convention: "POSITIVE = money leaving account (expense), NEGATIVE = money entering account (income/deposit)." This should resolve the ambiguity that caused failures in Approach 1.

We test this query with 5 phrasing styles:

Direct: "What is our spending by category?"
Ambiguous: "What are we doing with our money in different areas?"
Verbose: "Provide a comprehensive breakdown of total expenditures categorized by expense type, showing dollar amounts for each category, sorted by total amount spent in descending order"
Explicit: "Execute an aggregate query on bank_transactions to calculate SUM of amounts grouped by expense category, filtering for debit transactions"
Conversational: "Hey, what are we spending our money on? Can you break it down by category?"

Results

Approach 2 (Schema + Descriptions) - Spending by Category: 96% overall accuracy (+13 from Approach 1). Direct 100%, Verbose 100%, Explicit 100%, Casual 85%, Ambiguous 95%. The biggest gain is vague phrasing, which improved from 25% to 95%.

Analysis

At 96% overall accuracy (up from 83% in Approach 1), the annotations provide meaningful improvement. The biggest gain is in vague phrasing: from 25% to 95%. The annotations explain that category_primary represents "high-level expense classification," which helps the model choose the right grouping level.

This is a +13 percentage point improvement overall, driven almost entirely by better handling of ambiguous questions. When the question doesn't specify granularity, the annotations provide a clear default.

Why It Works (When It Works)

Original, detailed, technical (100%): These variants already performed well in Approach 1. The annotations reinforce correct behavior without changing outcomes.

Vague phrasing (95%): This is the big win. "What are we spending money on?" now consistently groups by category_primary because the annotations describe it as the appropriate level for expense breakdowns. The model no longer adds merchant names or detailed categories to the grouping.

Why It Fails

Casual phrasing (85%): Interestingly, casual phrasing dropped from 95% to 85%. The informal language ("hey, what are we spending our money on?") sometimes triggers more creative interpretations that don't match the expected output exactly.

This suggests annotations can occasionally overconstrain. When the model tries to be helpful by including extra context from the annotations, it may deviate from the simpler query we expect.

Test: Vendor Payments

The annotations now explain that bills.amount_paid represents "the portion of the bill that has been paid" and that bills track "obligations to vendors." This should clarify which column to use.

We test this query with 5 phrasing styles:

Direct: "How much have we paid to each vendor?"
Ambiguous: "Who have we paid?"
Verbose: "List each vendor and the total amount we have paid them based on our bill payment records."
Explicit: "SELECT vendors JOIN bills, SUM(amount_paid) GROUP BY vendor"
Conversational: "Show me how much we have paid each of our vendors"

Results

Approach 2 (Schema + Descriptions) - Vendor Payments: 61% overall accuracy (+41 from Approach 1). Direct 95%, Explicit 100%, Casual 65%, Verbose 45%, Ambiguous 0%. The largest improvement of any test case, as annotations explain what amount_paid represents.

Analysis

At 61% overall accuracy (up from 20% in Approach 1), annotations provide the largest improvement of any test case: +41 percentage points. The model now understands that amount_paid on bills represents actual payments made to vendors.

The original phrasing jumps from 0% to 95%. "How much have we paid to each vendor?" now correctly maps to bills.amount_paid because the annotations explain what that column represents. Technical phrasing remains at 100%, and casual improves to 65%.

But vague phrasing still fails completely (0%). "Who have we paid?" doesn't contain enough context to determine that we want vendor payments specifically. And detailed phrasing underperforms (45%) because mentioning "bill payment records" sometimes causes the model to over-specify the query.

This test case demonstrates where annotations add the most value: when domain knowledge cannot be inferred from schema structure. The schema shows that bills have both amount and amount_paid, but doesn't explain when to use which. Annotations bridge that gap, but only when the question contains enough signal to trigger the right interpretation.

Approach 3: Schema + Descriptions + Examples

Results summary: 95% overall accuracy (+16 from Approach 2). Spend by category hits 100%. Vendor payments jumps from 61% to 96%, with vague phrasings improving from 0% to 85%. The only remaining failure mode is entity disambiguation in the customers test, where vague phrasings still hover around 50%.

Approach 2 required significant iteration. The annotations didn't work on the first try, and it took several rounds of testing and rewriting before they started helping. This approach tests whether we can shortcut that process by showing the model examples of correct queries instead of describing what columns mean.

The model now receives:

Everything from Approach 2 (schema, column descriptions, usage notes)
Semantically similar examples: For each question, we retrieve the 5 most similar questions from our example database and include their correct SQL

We maintain a database of roughly 30 question-SQL pairs covering common query patterns. When a new question comes in, we generate an embedding and find the closest matches using cosine similarity. Those examples are included in the prompt as few-shot demonstrations.

The hypothesis is that by showing the model how similar questions were correctly answered, it should pick up the right patterns (entity disambiguation, column selection, grouping conventions) without us having to document every edge case.

There's a risk here worth noting. User feedback can conflict. The vague question "who are the people we work with" could mean customers (people who pay us) or vendors (people we pay). One user might expect customers, another might expect vendors. If we collect feedback from both, the examples could net out or actively make things worse. Small sample sizes amplify this problem.

Test: Recent Customers

The same test as before. Our RAG examples include variations of "who are our customers" and "show recent customers" but use different phrasings than the test variants. The test uses novel phrasings to avoid testing on "training data."

We test this query with 5 phrasing styles:

Direct: "Show me our 5 most recent customers"
Ambiguous: "Who are some of the newer people we work with?"
Verbose: "Retrieve the 5 most recently added customer records from the database, sorted by their creation date in descending order, including their name, company, and contact information"
Explicit: "SELECT from customers ORDER BY created_at DESC LIMIT 5"
Conversational: "Hey, who are our newest customers? Just show me like 5 of them."

Results

Approach 3 (Schema + Descriptions + Examples) - List Recent Customers: 90% overall accuracy (+10 from Approach 2). Direct 100%, Verbose 100%, Explicit 100%, Casual 100%, Ambiguous 50%. Examples help with vague phrasings but can't fully resolve entity disambiguation.

Analysis

At 90% overall accuracy (+10 from Approach 2), examples help the model handle more phrasing variations. Vague phrasing improves from 0% to 50%, which is notable given that we deliberately excluded the test phrasings from our example database.

The remaining failures illustrate the conflict risk. Our RAG examples include "who are the people we work with" mapped to customers. But a different organization might expect that to mean vendors. If we collected feedback from both types of users, the examples would contradict each other, and accuracy could actually decrease.

The 50% accuracy on vague phrasing suggests the model is genuinely uncertain. Sometimes the examples guide it correctly, sometimes not. This is the expected behavior when the question is fundamentally ambiguous and the examples don't fully resolve that ambiguity.

Test: Spending by Category

Our examples include queries for expense breakdowns and category groupings. The test checks whether similar examples help the model maintain consistent grouping behavior.

We test this query with 5 phrasing styles:

Direct: "What is our spending by category?"
Ambiguous: "What are we doing with our money in different areas?"
Verbose: "Provide a comprehensive breakdown of total expenditures categorized by expense type, showing dollar amounts for each category, sorted by total amount spent in descending order"
Explicit: "Execute an aggregate query on bank_transactions to calculate SUM of amounts grouped by expense category, filtering for debit transactions"
Conversational: "Hey, what are we spending our money on? Can you break it down by category?"

Results

Approach 3 (Schema + Descriptions + Examples) - Spending by Category: 100% accuracy across all phrasing styles. The first test case to achieve perfect scores. Examples provide strong guidance on grouping by category_primary.

Analysis

At 100% accuracy across all phrasing styles, this is the first test case to achieve perfect scores. The examples provide strong guidance on grouping by category_primary rather than category_detailed or merchant name.

This is the best-case scenario for RAG: the examples match the query pattern closely, and there's no ambiguity in what the user wants. When someone asks about spending by category, the semantically similar examples all agree on how to answer it.

Test: Vendor Payments

This test was where Approach 1 struggled most (20% accuracy). Examples include vendor payment queries that explicitly use bills.amount_paid. The question is whether seeing correct examples transfers to novel phrasings.

We test this query with 5 phrasing styles:

Direct: "How much have we paid to each vendor?"
Ambiguous: "Who have we paid?"
Verbose: "List each vendor and the total amount we have paid them based on our bill payment records."
Explicit: "SELECT vendors JOIN bills, SUM(amount_paid) GROUP BY vendor"
Conversational: "Show me how much we have paid each of our vendors"

Results

Approach 3 (Schema + Descriptions + Examples) - Vendor Payments: 96% overall accuracy (+35 from Approach 2). Most styles achieve 100%, Ambiguous 85%. The biggest improvement of any test case. Examples teach the model the correct column and entity mapping.

Analysis

At 96% overall accuracy (+35 from Approach 2), this is the biggest improvement of any test case. Vague phrasing jumps from 0% to 85%. The examples teach the model that "companies we sent money to" means vendors, and that the correct column is amount_paid on bills.

The remaining 15% failure rate on vague phrasing is informative. The test uses "which companies did we send money to?" which is close to but not identical to our example phrasings. Sometimes the model still misinterprets this as asking about bank transactions instead of bills.

This result suggests that RAG examples are most effective when they cover the conceptual space well but don't need to match exact phrasings. The model generalizes from patterns, not from verbatim matches.

Conclusions

The experiments suggest a few things worth noting about text-to-SQL in practice.

Context quality matters more than context quantity. Adding descriptions improved accuracy by 18 percentage points, but only after several iterations of testing and rewriting. Our first attempt at annotations didn't help at all. The descriptions that worked were the ones that addressed specific failure modes we'd already observed. Generic documentation is less useful than targeted clarification.

Examples transfer surprisingly well to novel phrasings. The RAG approach uses semantic similarity to find relevant examples, not keyword matching. A question like "which companies did we send money to?" can benefit from an example about "vendor payments" even though the words don't overlap much. The model generalizes from the pattern, not the exact phrasing.

Some ambiguity can't be resolved with more context. "Who are the people we work with?" fails at 50% even with examples, because the question is genuinely ambiguous. Different users would expect different answers. This isn't a context problem; it's an input problem. Solving it would require either constraining what users can ask or adding a clarification step.

User feedback can conflict. RAG examples are powerful, but collecting them from user feedback introduces risk. If one user marks "people we work with" as customers and another marks it as vendors, the examples contradict each other. In small samples, this can make accuracy worse rather than better. Production systems would need curation workflows or statistical filtering.

Potential Improvements

Several approaches could push accuracy higher, though each comes with tradeoffs.

More capable models. We used Claude Sonnet 4, which is optimized for speed and cost. A more capable model like Claude Opus might handle ambiguous queries better, particularly ones requiring multi-step reasoning. The tradeoff is latency and cost, both of which matter for interactive applications.

Agentic pipelines. Instead of a single prompt, the model could decompose complex queries into steps: first identify the intent, then select relevant tables, then generate SQL, then validate the result. Each step can include self-correction. The MAC-SQL paper shows this approach reaching higher accuracy on benchmarks, at the cost of multiple LLM calls per query.

Clarifying questions. When a query is ambiguous, the system could ask for clarification before generating SQL. "By 'people we work with,' do you mean customers or vendors?" This turns a 50% guess into a reliable answer, at the cost of adding friction to the interaction. For enterprise analytics tools, this tradeoff is probably worth it.

Query validation and refinement. After generating SQL, the system could execute it on a sample of data and show the user what it found. If the results look wrong, the user can provide feedback and the model can refine the query. This creates a feedback loop without requiring the user to understand SQL.

Semantic layers remain the most reliable path. Instead of generating raw SQL, the model would select from pre-defined business metrics. This eliminates the SQL generation failure mode entirely, at the cost of flexibility. It also makes the system auditable: you can see exactly what metrics exist and how they're calculated, rather than trusting arbitrary generated queries.

References & Further Reading

The approach in this post draws from several areas of research and industry practice.

Benchmarks

The BIRD benchmark (Big Bench for Large-scale Database Grounded Text-to-SQL Evaluation) informed our testing methodology. BIRD focuses on realistic enterprise databases with dirty data, complex schemas, and domain-specific terminology. Their finding that schema context alone yields roughly 40% accuracy on complex queries aligns with what we observed.

The Spider benchmark is the older standard for text-to-SQL evaluation. It's cleaner than BIRD but less realistic for production use cases.

RAG for Text-to-SQL

The example-based approach (Approach 3) is a simplified version of what Vanna.ai calls "training" on your database. Their insight is that embedding-based retrieval of similar queries provides strong few-shot guidance without fine-tuning the model. We use OpenAI's text-embedding-3-small for similarity search, though any embedding model would work.

Semantic Layers

The idea of constraining LLM output to pre-defined metrics comes from the semantic layer pattern popularized by tools like Cube and dbt's Semantic Layer. Instead of generating SQL directly, the LLM selects from a catalog of business metrics, each with a verified SQL definition. This trades flexibility for reliability.

Multi-Agent Approaches

The MAC-SQL paper (Multi-Agent Collaborative Text-to-SQL) explores decomposing the problem into specialized agents: one for understanding the question, one for selecting tables, one for generating SQL, and one for validation. This is a natural next step beyond what we tested here, particularly for complex queries that require multi-step reasoning.

Academic Survey

For a comprehensive overview of the field, the ACM Computing Surveys paper on LLMs for Text-to-SQL covers prompting strategies, fine-tuning approaches, and evaluation methodologies in detail.

#Results Summary

#Important Caveats

#What the Results Do Show

#The Setup

#Phrasing Variants

#Approach 1: Schema Only

#Test: Recent Customers

#Results

#Analysis

#Why It Works (When It Works)

#Why It Fails

#Test: Spending by Category

#Results

#Analysis

#Why It Works (When It Works)

#Why It Fails

#Test: Vendor Payments

#Results

#Analysis

#Why It Fails

#Approach 2: Schema + Descriptions

#Test: Recent Customers

#Results

#Analysis

#Why It Works (When It Works)

#Why It Fails

#Test: Spending by Category

#Results

#Analysis

#Why It Works (When It Works)

#Why It Fails

#Test: Vendor Payments

#Results

#Analysis

#Approach 3: Schema + Descriptions + Examples

#Test: Recent Customers

#Results

#Analysis

#Test: Spending by Category

#Results

#Analysis

#Test: Vendor Payments

#Results

#Analysis

#Conclusions

#Potential Improvements

#References & Further Reading

#Benchmarks

#RAG for Text-to-SQL

#Semantic Layers

#Multi-Agent Approaches

#Academic Survey

Results Summary

Important Caveats

What the Results Do Show

The Setup

Phrasing Variants

Approach 1: Schema Only

Test: Recent Customers

Results

Analysis

Why It Works (When It Works)

Why It Fails

Test: Spending by Category

Results

Analysis

Why It Works (When It Works)

Why It Fails

Test: Vendor Payments

Results

Analysis

Why It Fails

Approach 2: Schema + Descriptions

Test: Recent Customers

Results

Analysis

Why It Works (When It Works)

Why It Fails

Test: Spending by Category

Results

Analysis

Why It Works (When It Works)

Why It Fails

Test: Vendor Payments

Results

Analysis

Approach 3: Schema + Descriptions + Examples

Test: Recent Customers

Results

Analysis

Test: Spending by Category

Results

Analysis

Test: Vendor Payments

Results

Analysis

Conclusions

Potential Improvements

References & Further Reading

Benchmarks

RAG for Text-to-SQL

Semantic Layers

Multi-Agent Approaches

Academic Survey