SQL That Reads Between the Lines

LLM-powered aggregate functions for DuckDB — turning 2,392 Yelp reviews into operational intelligence

TL;DR

We analyzed 11 years of Yelp reviews for Commander's Palace (New Orleans) using two SQL aggregate functions: qf_llm_summarize reads reviews and returns structured JSON with scores, issues, and action items. qf_llm_distill takes the messy labels and builds a clean multi-level taxonomy. Both use Claude under the hood. The entire pipeline is SQL — no ML code, no manual labeling. Total cost: ~$3.

View the Dashboard → 🚜 Query.Farm

The Idea

What if you could write a SQL query that reads thousands of restaurant reviews and produces a structured operational dashboard — complete with severity-rated issues, trend analysis, and actionable recommendations? Not by counting keywords, but by actually understanding what customers are saying?

That's what 🚜 Query.Farm enables. Our VGI (Vector Gateway Interface) connects DuckDB to Python workers via Apache Arrow. We built qf_llm_summarize — an aggregate function that accumulates text during GROUP BY, then calls Claude to produce structured analysis in the finalize phase. It works with any GROUP BY, any data source, any prompt.

1 The Data

We used the Yelp Open Dataset — specifically, every review of Commander's Palace in New Orleans, one of America's most iconic restaurants. We sampled up to 20 reviews per month from January 2011 through January 2022.

133
Months
2,392
Reviews
11 years
Time Span
~$3
Total LLM Cost

2 The Pipeline

Everything runs from a single ./run.sh script. Five stages, each building on the last:

Load & Sample Reviews
DuckDB reads the 5GB Yelp JSON directly, joins reviews to the business, samples 20 random reviews per month. Outputs reviews.parquet.
Monthly Analysis — 133 LLM calls
GROUP BY month with qf_llm_summarize(). Each month's reviews are accumulated during the UPDATE phase, then Claude analyzes them in FINALIZE. Returns JSON with momentum score (-5 to +5), severity-rated strengths and issues, action items, and a summary. Calls run 10-at-a-time in parallel.
Category Distillation — pyramid normalization
The 133 monthly analyses produced ~700 unique category labels. We normalized them using qf_llm_summarize() itself — chunked into groups of 128, each chunk normalized by the LLM using pipe-delimited output, then a second pass consolidates into ~27 canonical categories. A third pass groups those into 6 parent themes. All in SQL.
Apply Mappings
Python script maps the raw LLM labels to canonical categories, applies theme groupings, and exports the final dashboard JSON. Severity scores are preserved through the mapping — if two labels map to the same category, the max severity wins.
Build Dashboard
Jinja2 template renders the final HTML from the JSON data. All charts use Chart.js or plain HTML/CSS. No build tools, no framework — one static file you can open in a browser.

3 The SQL

The core of the pipeline is a single SQL pattern: GROUP BY + qf_llm_summarize(). The aggregate function accumulates review text, then calls Claude to analyze it.

-- Monthly analysis: one LLM call per month
SELECT month,
       qf_llm_summarize(
           review_text,
           'Analyze these reviews for the owner.
            Return JSON with momentum, strengths,
            issues, action_items, and summary.'
       ) AS analysis
FROM reviews
GROUP BY month;

4 Topic Distillation with qf_llm_distill

Here's the problem: 133 monthly LLM analyses produced ~700 unique category labels. The LLM called the same concept different things in different months — "weekend-brunch-slow-pacing", "brunch-service-delays", "sunday-brunch-understaffed" are all the same issue. Without normalization, the dashboard would be an unreadable wall of one-off labels.

We built qf_llm_distill — an aggregate function that takes messy labels and produces a clean multi-level taxonomy. One SQL call replaces the entire normalization pipeline:

-- One call: 700 labels → 30 categories → 6 themes
SELECT qf_llm_distill(label, [30, 6]) AS taxonomy
FROM all_labels;

The output is a JSON map. Every input label gets a complete path through all levels:

{
  "weekend-brunch-slow-pacing": ["service-timing", "Service & Staff"],
  "bread-pudding-souffle-hit": ["desserts", "Food & Menu"],
  "host-stand-rude": ["reservations-hosting", "Operations"],
  ...
}

How It Works Internally

The [30, 6] parameter means: first reduce to ~30 categories, then reduce those to ~6 themes. The function runs a pyramid of LLM calls with validation at each level:

Level 1: Chunk & Normalize (700 → ~30)

Split labels into groups of 128. Each group gets one LLM call that maps every label to a natural category. Different chunks may independently produce "service-pacing" and "service-timing" — that's expected, the next level handles it.

Level 2: Consolidate (~30 → ~6)

Feed the level-1 categories back into the LLM as a single batch. It merges near-duplicates ("service-pacing" + "service-timing" → "service-timing") and produces the final themes.

Validation & Repair

After each level, the function checks: did every input get mapped? Is the output within the target size? Missing items get re-submitted. If there are too many groups, a consolidation pass merges the smallest ones. After 3 repair attempts, any remaining unmapped items go to "other" with a warning. The guarantee: every input label has a complete path through all levels.

A Simple Example

20 programming labels distilled into 5 categories and 2 themes:

SELECT qf_llm_distill(label, [5, 2])
FROM (VALUES
  ('python-web-framework'), ('javascript-react-hooks'),
  ('css-flexbox-layout'), ('python-data-science'),
  ('sql-query-optimization'), ('python-machine-learning'),
  ('css-grid-system'), ('javascript-vue-components'),
  ('html-semantic-markup'), ('sql-join-performance'),
  -- ... 20 labels total
) AS t(label);

Result:

THEME CATEGORY LABELS
backend python-backend python-web-framework, python-flask-api, python-django-orm
python-data python-data-science, python-machine-learning, python-pandas
database sql-query-optimization, sql-database-indexing, sql-join-performance, sql-stored-procedures
frontend javascript-frontend javascript-react-hooks, javascript-node-express, javascript-typescript, javascript-vue-components
web-styling css-flexbox-layout, css-grid-system, html-semantic-markup, css-responsive-design, html-accessibility

20 labels → 5 categories → 2 themes. Every label mapped. One SQL call.

The Deeper Point

GROUP BY + LLM is a general-purpose pattern that works at any level of abstraction. qf_llm_summarize turns reviews into summaries. qf_llm_distill turns labels into taxonomies. The same mechanism — accumulate rows, let the LLM reason over them — applies to summarization, classification, normalization, and anything else where you need to take many things and produce fewer, more structured things.

SQL already knows how to group data. Now the LLM knows how to make sense of each group.

5 How qf_llm_summarize Works

Under the hood, qf_llm_summarize is a VGI aggregate function with three phases:

UPDATE — accumulate text

For each row in the group, append the text to the state. No LLM calls. This runs once per batch of input rows.

COMBINE — merge parallel states

When DuckDB merges hash tables from parallel workers, concatenate the accumulated text. If the combined text exceeds 50K characters, call Claude to compress it (one LLM call to condense, keeping key themes).

FINALIZE — produce results

For each group, send the accumulated text + the user's prompt to Claude. Returns the LLM's response as a VARCHAR. All groups in a finalize batch run concurrently (up to 10 at a time) with automatic retry on rate limits.

The prompt is a regular column, not a constant. This means different groups can get different prompts — critical for hierarchical summarization where each level needs a different analytical lens.

The Python Behind It

These aren't black boxes — they're Python classes you can read and extend. Here's the essence of qf_llm_summarize, stripped to its core:

class SummarizeFunction(AggregateFunction[SummarizeState]):

    class Meta:
        name = "qf_llm_summarize"

    @classmethod
    def update(cls, states, group_ids, text, prompt):
        # Just accumulate text. No LLM calls.
        for i in range(len(group_ids)):
            gid = group_ids[i].as_py()
            s = states[gid]
            states[gid] = SummarizeState(
                texts=s.texts + "\n---\n" + text[i].as_py(),
                prompt=prompt[i].as_py(),
            )

    @classmethod
    def combine(cls, source, target, params):
        # Merge two partial states from parallel workers
        return SummarizeState(
            texts=target.texts + "\n---\n" + source.texts,
            prompt=target.prompt,
        )

    @classmethod
    def finalize(cls, group_ids, states, params):
        # Now call the LLM — once per group, in parallel
        results = []
        for gid in group_ids:
            s = states[gid.as_py()]
            response = anthropic.Anthropic().messages.create(
                model="claude-sonnet-4-20250514",
                messages=[{"role": "user",
                           "content": s.prompt + "\n\n" + s.texts}],
            )
            results.append(response.content[0].text)
        return pa.record_batch({"result": pa.array(results)})

That's a real DuckDB aggregate function. update() accumulates rows, combine() merges parallel workers, finalize() calls Claude. DuckDB handles the GROUP BY, parallelism, and memory management. Your Python just defines the logic.

6 What the Dashboard Shows

Cumulative Momentum

Each month the LLM assigns a momentum score (-5 to +5). We accumulate these with exponential decay (factor 0.85, ~4 month half-life). The line rises during good stretches and drops during bad ones, naturally fading back toward zero. It's a health indicator, not a permanent score.

Operational Themes Timeline

Each dot is a month where a category appeared as a strength (green) or issue (red). Dot size = severity (1-5). Grouped under parent themes so you see the domain first, then the specific. Severity 5 issue dots pulse gold — they're the crises that need immediate attention.

Category Balance

For each category, green bars show months as a strength, red bars show months as an issue. Grouped by theme. Instantly shows what's net-positive (desserts, atmosphere) vs net-negative (food quality, service timing).

Monthly Detail Table

Every month's full analysis: momentum badge, severity-rated strength and issue pills (gold for severity 4-5), the LLM's owner-focused summary, and specific action items. Searchable and sortable.

7 Why This Matters

Traditional review analysis counts stars and extracts keywords. This approach reads and understands the reviews. A 4-star review that says "great food but I got sick afterward" gets flagged as a severity-5 safety issue despite the high rating. A 2-star review that's really about parking doesn't tank the food score.

The entire pipeline is SQL-native. No Python ML pipelines, no separate ETL jobs, no dashboarding tools. DuckDB reads the data, VGI aggregate functions call Claude, and Jinja2 renders the HTML. The same qf_llm_summarize() function works for monthly analysis, category normalization, and theme grouping — it's just different prompts.

And it costs about $3 total to analyze 11 years of reviews for one restaurant. Scale it to a restaurant group with 50 locations and you're looking at $150 for comprehensive operational intelligence across every location, every month, with actionable specifics.

View the Dashboard →