Skip to main content

SQL Efficiency acts as a heuristic query cost analyzer, assessing whether a generated SQL query is structured efficiently or poses an availability risk to the database.

Metric definition

SQL Efficiency — A binary metric that evaluates whether the generated SQL query avoids known performance anti-patterns and resource exhaustion risks.
  • Type: Binary
    • 1 (Performant): The query is structured efficiently and poses no obvious availability risk.
    • 0 (Non-Performant): The query uses known anti-patterns or poses a resource exhaustion risk.
This metric assumes the generated SQL is semantically correct and secure. Its primary purpose is to assess query efficiency and identify potential Denial of Service (DoS) risks to downstream databases. Here’s a scale that shows the relationship between SQL Efficiency and potential impact on your AI system:
0100
Inefficient
Query uses anti-patterns or poses resource exhaustion risks.
Efficient
Query is well-structured with no obvious performance issues.
Scale is 0–100 and is derived from binary judgments converted into a confidence score.

Calculation method

SQL Efficiency is computed through a comprehensive code review process:
1

Model Request

One or more evaluation requests are sent to an LLM evaluator to analyze the SQL query for performance anti-patterns.
2

Prompt Engineering

A specialized chain-of-thought prompt guides the model to evaluate the query for efficiency issues including cross-product detection, resource exhaustion risks, index suppression, and structural flaws.
3

Evaluation Process

The evaluator analyzes the query for known anti-patterns such as Cartesian joins, unbounded recursive CTEs, non-SARGable predicates, inefficient outer join filtering, and correlated subqueries.
4

Score Calculation

Based on the evaluation, a binary score is assigned: 1 (Performant) if the query is efficient, or 0 (Non-Performant) if anti-patterns are detected.
This metric is computed by prompting an LLM and may require multiple LLM calls to compute, which can impact usage and billing.

Supported nodes

  • LLM span
Inputs considered (when available):
  • The generated SQL query (output)
  • Optional context such as the natural language query
  • Schema information for understanding table relationships

What constitutes efficient (1)

  • Avoids Anti-Patterns: No known, statically detectable performance anti-patterns.
  • No Resource Exhaustion: No obvious structures like Cartesian joins or unbounded recursive CTEs.
  • Efficient Filtering: Filters correctly placed (e.g., ON clause for LEFT JOIN conditions).
  • Index-Friendly: Predicates structured to allow index utilization.

What constitutes inefficient (0)

Unconstrained Cartesian Joins

Multiple tables listed without linking conditions, creating massive data explosions:
-- Inefficient: Creates N × M rows
SELECT T1.name, T2.dept_name 
FROM employees AS T1, departments AS T2

-- Efficient: Properly joined
SELECT T1.name, T2.dept_name 
FROM employees AS T1 
JOIN departments AS T2 ON T1.dept_id = T2.id

Unsafe Recursive CTEs

Recursive Common Table Expressions without proper termination conditions:
-- Inefficient: Resource exhaustion risk
WITH RECURSIVE NumberSequence AS (
    SELECT 1 as num
    UNION ALL
    SELECT num + 1 FROM NumberSequence WHERE num < 100000
)
SELECT * FROM NumberSequence

Inefficient Outer Join Filtering

Placing filters for outer-joined tables in WHERE instead of ON:
-- Inefficient: Nullifies LEFT JOIN behavior
SELECT T1.dept_name, T2.name
FROM departments AS T1
LEFT JOIN employees AS T2 ON T1.id = T2.dept_id
WHERE T2.name = 'John'

-- Efficient: Filter in ON clause
SELECT T1.dept_name, T2.name
FROM departments AS T1
LEFT JOIN employees AS T2 ON T1.id = T2.dept_id AND T2.name = 'John'

Index-Suppressing Predicates (Non-SARGable)

Wrapping columns in functions prevents index usage:
-- Inefficient: Forces full table scan
WHERE YEAR(date_col) = 2023

-- Efficient: Allows index usage
WHERE date_col BETWEEN '2023-01-01' AND '2023-12-31'

Other Anti-Patterns

  • SELECT *: Retrieving all columns when only specific ones are needed
  • Correlated Subqueries: Subqueries referencing outer query columns (N+1 problem)
  • Implicit Type Casting: Comparing mismatched types (e.g., VARCHAR to unquoted integers)
  • Misplaced HAVING: Using HAVING for non-aggregated column filters
  • Unnecessary DISTINCT: Deduplicating already-unique columns
  • Inefficient UNION: Using UNION when UNION ALL suffices

Example use cases

  • Pre-flight validation of generated SQL before execution on production databases.
  • Identifying potential DoS risks in user-facing Text-to-SQL applications.
  • Training and improving SQL generation models to avoid anti-patterns.
  • Code review automation for data analytics pipelines.
Example: A business intelligence platform where users generate ad-hoc queries — SQL Efficiency catches a generated query with a missing JOIN condition that would create a Cartesian product of 10 million rows, preventing a potential database outage.

Best practices

Specify SQL dialect and schema

Providing the SQL dialect and schema information helps the evaluator understand dialect-specific anti-patterns and table relationships for optimization analysis.

Combine with other Text-to-SQL metrics

Use alongside SQL Correctness and SQL Adherence for comprehensive validation of generated queries.

Iterate with CLHF

Use continuous learning via human feedback to tune detection for your specific database patterns.

Review flagged queries

Manually review inefficient queries to understand patterns and improve your SQL generation prompts.