Back to Resources
Technical

How AI Generates SQL From Natural Language: A Technical Deep Dive

Sarah Chen
December 15, 2025
10 min read

How AI Generates SQL From Natural Language: A Technical Deep Dive

Natural language to SQL (NL2SQL) is the process of converting plain English questions into executable database queries. At Skopx, our NL2SQL pipeline achieves 94.2% accuracy on production workloads across PostgreSQL and Supabase databases, with a median latency of 380ms from question to query result. This article breaks down the architecture, the hard trade-offs, and the techniques that make it work.

What Is Natural Language to SQL?

Natural language to SQL is a technique where an AI model parses a human question, like "What were our top 10 customers by revenue last quarter?", and generates a syntactically correct, semantically accurate SQL query against a specific database schema. Unlike generic text-to-SQL demos, production NL2SQL must handle schema ambiguity, access control, and query safety at scale.

How Does the Schema Understanding Layer Work?

The first stage of our pipeline is schema introspection. When a user connects a PostgreSQL or Supabase database, we run a metadata extraction process that captures table names, column types, foreign key relationships, indexes, and row count estimates. This metadata is compressed into a schema representation that fits within Claude's context window, typically 2,000-4,000 tokens for a 50-table database.

We use a tiered approach. For databases with fewer than 30 tables, we include the full schema in every prompt. For larger databases (we have users with 200+ tables), we use vector search over column descriptions and table comments to select only the 8-12 most relevant tables for each question. This relevance filtering alone reduced hallucinated table references by 73%.

// Schema relevance scoring
interface SchemaContext {
  tables: TableMeta[];
  relevanceScores: Map<string, number>;
  foreignKeys: ForeignKeyChain[];
}

async function selectRelevantSchema(
  question: string,
  fullSchema: SchemaContext,
  maxTables: number = 12
): Promise<SchemaContext> {
  const embedding = await embedQuestion(question);
  const scored = fullSchema.tables.map(t => ({
    table: t,
    score: cosineSimilarity(embedding, t.embedding)
      + (t.recentQueryFrequency * 0.2)  // boost frequently queried tables
      + (t.hasComment ? 0.1 : 0)         // boost documented tables
  }));
  return topK(scored, maxTables);
}

How Does the Query Generation Pipeline Work?

Once we have the relevant schema context, we construct a prompt for Claude that includes the schema, the user's question, and a set of safety constraints. The prompt engineering here is critical, small changes in instruction phrasing produce measurably different accuracy rates.

Our prompt includes three key sections: (1) the schema context with column types and relationships, (2) a set of SQL safety rules (no DROP, DELETE, or UPDATE unless explicitly permitted), and (3) example question-SQL pairs specific to that database. These few-shot examples come from our learning engine, which stores successful past queries with user feedback scores.

The model generates a candidate SQL query, which then passes through a validation layer. This layer parses the SQL AST, checks that all referenced tables and columns exist in the schema, validates JOIN conditions match foreign keys, and ensures the query does not violate row-level security policies. Queries that fail validation are retried with error context, this self-correction loop recovers about 31% of initially invalid queries.

What Makes Multi-Dialect Support Difficult?

Supporting multiple SQL dialects. PostgreSQL, MySQL, SQLite, introduces subtle complexity. PostgreSQL uses ILIKE for case-insensitive matching while MySQL uses LOWER() with LIKE. Date functions differ significantly: DATE_TRUNC('month', created_at) in PostgreSQL versus DATE_FORMAT(created_at, '%Y-%m-01') in MySQL.

We handle this by injecting dialect-specific rules into the prompt and maintaining a dialect translation layer that catches common cross-dialect mistakes. In practice, 89% of our production queries target PostgreSQL through Supabase, so that dialect receives the most optimization. Our PostgreSQL-specific accuracy is 96.1%, while MySQL sits at 91.3%.

How Do You Prevent Dangerous Queries?

Query safety is non-negotiable. Every generated query passes through three safety layers before execution. First, the SQL AST parser rejects any DDL statements (CREATE, ALTER, DROP) and DML mutations (INSERT, UPDATE, DELETE) unless the user has explicitly enabled write mode for that connection. Second, we inject a LIMIT 1000 clause on any query that lacks a LIMIT, preventing accidental full-table scans on tables with millions of rows. Third, we enforce Supabase Row Level Security policies by executing queries through the user's authenticated connection, never through a service role key.

-- Example: Generated query with safety constraints applied
SELECT
  c.name AS customer_name,
  SUM(o.total_amount) AS total_revenue
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE o.created_at >= DATE_TRUNC('quarter', CURRENT_DATE - INTERVAL '3 months')
  AND o.created_at < DATE_TRUNC('quarter', CURRENT_DATE)
GROUP BY c.name
ORDER BY total_revenue DESC
LIMIT 10;
-- Execution: read-only connection, RLS enforced, 1.2s timeout

How Does the Learning Loop Improve Accuracy Over Time?

Every query result is presented to the user with a thumbs-up/thumbs-down feedback mechanism. When a user confirms a query was correct, we store the question-SQL pair as a few-shot example for that specific database schema. When a user marks a query as wrong, we store the correction.

Over time, this creates a per-database knowledge base of proven query patterns. After 50+ feedback signals, we typically see accuracy improve from 94% to 97% for that specific database. The learning engine uses exponential moving averages with debiasing to weight recent feedback more heavily, preventing stale patterns from degrading quality.

What Are the Latency Characteristics?

End-to-end latency breaks down as follows: schema retrieval (15ms with caching, 120ms cold), vector search for relevant tables (45ms), prompt construction (5ms), Claude API call (250-600ms depending on query complexity), SQL validation (10ms), query execution (variable, median 80ms). Total median latency is 380ms for cached schemas, 480ms for cold starts.

We optimize the critical path by pre-computing schema embeddings on connection setup and caching them in memory. The Claude API call is the dominant cost, and we use Claude 3.5 Haiku for simple queries (sub-200ms) and Claude 3.5 Sonnet for complex multi-join queries that need deeper reasoning.

Key Takeaways

Building production NL2SQL requires solving schema understanding, query safety, dialect support, and continuous learning simultaneously. No single technique achieves high accuracy alone, it is the combination of vector-based schema selection, few-shot learning from user feedback, AST-based validation, and self-correction loops that brings accuracy above 94%. The hardest remaining problems are ambiguous column names (does "revenue" mean gross or net?) and implicit time ranges (does "recent" mean last week or last month?). We address these through business context extraction, which we cover in our live platform architecture post.

Share this article

Sarah Chen

Contributing writer at Skopx

Stay Updated

Get the latest insights on AI-powered code intelligence delivered to your inbox.