TL;DR
We analyzed 11 years of Yelp reviews for Commander's Palace (New Orleans) using two
SQL aggregate functions: qf_llm_summarize reads reviews and returns structured JSON with
scores, issues, and action items. qf_llm_distill takes the messy labels and builds a
clean multi-level taxonomy. Both use Claude under the hood. The entire pipeline is SQL — no
ML code, no manual labeling. Total cost: ~$3.
The Idea
What if you could write a SQL query that reads thousands of restaurant reviews and produces a structured operational dashboard — complete with severity-rated issues, trend analysis, and actionable recommendations? Not by counting keywords, but by actually understanding what customers are saying?
That's what 🚜 Query.Farm enables. Our VGI (Vector Gateway Interface) connects
DuckDB to Python workers via Apache Arrow. We built qf_llm_summarize — an aggregate
function that accumulates text during GROUP BY, then calls Claude to produce structured analysis
in the finalize phase. It works with any GROUP BY, any data source, any prompt.
1 The Data
We used the Yelp Open Dataset — specifically, every review of Commander's Palace in New Orleans, one of America's most iconic restaurants. We sampled up to 20 reviews per month from January 2011 through January 2022.
2 The Pipeline
Everything runs from a single ./run.sh script. Five stages, each building on the last:
reviews.parquet.GROUP BY month with qf_llm_summarize(). Each month's reviews are accumulated during
the UPDATE phase, then Claude analyzes them in FINALIZE. Returns JSON with momentum score (-5 to +5),
severity-rated strengths and issues, action items, and a summary. Calls run 10-at-a-time in parallel.
qf_llm_summarize() itself — chunked into groups of 128, each chunk normalized
by the LLM using pipe-delimited output, then a second pass consolidates into ~27 canonical categories.
A third pass groups those into 6 parent themes. All in SQL.
3 The SQL
The core of the pipeline is a single SQL pattern: GROUP BY + qf_llm_summarize().
The aggregate function accumulates review text, then calls Claude to analyze it.
-- Monthly analysis: one LLM call per month SELECT month, qf_llm_summarize( review_text, 'Analyze these reviews for the owner. Return JSON with momentum, strengths, issues, action_items, and summary.' ) AS analysis FROM reviews GROUP BY month;
4 Topic Distillation with qf_llm_distill
Here's the problem: 133 monthly LLM analyses produced ~700 unique category labels. The LLM called the same concept different things in different months — "weekend-brunch-slow-pacing", "brunch-service-delays", "sunday-brunch-understaffed" are all the same issue. Without normalization, the dashboard would be an unreadable wall of one-off labels.
We built qf_llm_distill — an aggregate function that takes messy labels and
produces a clean multi-level taxonomy. One SQL call replaces the entire normalization pipeline:
-- One call: 700 labels → 30 categories → 6 themes SELECT qf_llm_distill(label, [30, 6]) AS taxonomy FROM all_labels;
The output is a JSON map. Every input label gets a complete path through all levels:
{
"weekend-brunch-slow-pacing": ["service-timing", "Service & Staff"],
"bread-pudding-souffle-hit": ["desserts", "Food & Menu"],
"host-stand-rude": ["reservations-hosting", "Operations"],
...
}
How It Works Internally
The [30, 6] parameter means: first reduce to ~30 categories, then reduce those to ~6 themes.
The function runs a pyramid of LLM calls with validation at each level:
Level 1: Chunk & Normalize (700 → ~30)
Split labels into groups of 128. Each group gets one LLM call that maps every label to a natural category. Different chunks may independently produce "service-pacing" and "service-timing" — that's expected, the next level handles it.
Level 2: Consolidate (~30 → ~6)
Feed the level-1 categories back into the LLM as a single batch. It merges near-duplicates ("service-pacing" + "service-timing" → "service-timing") and produces the final themes.
Validation & Repair
After each level, the function checks: did every input get mapped? Is the output within the target size? Missing items get re-submitted. If there are too many groups, a consolidation pass merges the smallest ones. After 3 repair attempts, any remaining unmapped items go to "other" with a warning. The guarantee: every input label has a complete path through all levels.
A Simple Example
20 programming labels distilled into 5 categories and 2 themes:
SELECT qf_llm_distill(label, [5, 2]) FROM (VALUES ('python-web-framework'), ('javascript-react-hooks'), ('css-flexbox-layout'), ('python-data-science'), ('sql-query-optimization'), ('python-machine-learning'), ('css-grid-system'), ('javascript-vue-components'), ('html-semantic-markup'), ('sql-join-performance'), -- ... 20 labels total ) AS t(label);
Result:
| THEME | CATEGORY | LABELS |
|---|---|---|
| backend | python-backend | python-web-framework, python-flask-api, python-django-orm |
| python-data | python-data-science, python-machine-learning, python-pandas | |
| database | sql-query-optimization, sql-database-indexing, sql-join-performance, sql-stored-procedures | |
| frontend | javascript-frontend | javascript-react-hooks, javascript-node-express, javascript-typescript, javascript-vue-components |
| web-styling | css-flexbox-layout, css-grid-system, html-semantic-markup, css-responsive-design, html-accessibility |
20 labels → 5 categories → 2 themes. Every label mapped. One SQL call.
The Deeper Point
GROUP BY + LLM is a general-purpose pattern that works at any level of abstraction.
qf_llm_summarize turns reviews into summaries. qf_llm_distill turns labels
into taxonomies. The same mechanism — accumulate rows, let the LLM reason over them —
applies to summarization, classification, normalization, and anything else where
you need to take many things and produce fewer, more structured things.
SQL already knows how to group data. Now the LLM knows how to make sense of each group.
5 How qf_llm_summarize Works
Under the hood, qf_llm_summarize is a VGI aggregate function with three phases:
UPDATE — accumulate text
For each row in the group, append the text to the state. No LLM calls. This runs once per batch of input rows.
COMBINE — merge parallel states
When DuckDB merges hash tables from parallel workers, concatenate the accumulated text. If the combined text exceeds 50K characters, call Claude to compress it (one LLM call to condense, keeping key themes).
FINALIZE — produce results
For each group, send the accumulated text + the user's prompt to Claude. Returns the LLM's response as a VARCHAR. All groups in a finalize batch run concurrently (up to 10 at a time) with automatic retry on rate limits.
The prompt is a regular column, not a constant. This means different groups can get different prompts — critical for hierarchical summarization where each level needs a different analytical lens.
The Python Behind It
These aren't black boxes — they're Python classes you can read and extend. Here's the
essence of qf_llm_summarize, stripped to its core:
class SummarizeFunction(AggregateFunction[SummarizeState]): class Meta: name = "qf_llm_summarize" @classmethod def update(cls, states, group_ids, text, prompt): # Just accumulate text. No LLM calls. for i in range(len(group_ids)): gid = group_ids[i].as_py() s = states[gid] states[gid] = SummarizeState( texts=s.texts + "\n---\n" + text[i].as_py(), prompt=prompt[i].as_py(), ) @classmethod def combine(cls, source, target, params): # Merge two partial states from parallel workers return SummarizeState( texts=target.texts + "\n---\n" + source.texts, prompt=target.prompt, ) @classmethod def finalize(cls, group_ids, states, params): # Now call the LLM — once per group, in parallel results = [] for gid in group_ids: s = states[gid.as_py()] response = anthropic.Anthropic().messages.create( model="claude-sonnet-4-20250514", messages=[{"role": "user", "content": s.prompt + "\n\n" + s.texts}], ) results.append(response.content[0].text) return pa.record_batch({"result": pa.array(results)})
That's a real DuckDB aggregate function. update() accumulates rows,
combine() merges parallel workers, finalize() calls Claude.
DuckDB handles the GROUP BY, parallelism, and memory management. Your Python just defines
the logic.
6 What the Dashboard Shows
Cumulative Momentum
Each month the LLM assigns a momentum score (-5 to +5). We accumulate these with exponential decay (factor 0.85, ~4 month half-life). The line rises during good stretches and drops during bad ones, naturally fading back toward zero. It's a health indicator, not a permanent score.
Operational Themes Timeline
Each dot is a month where a category appeared as a strength (green) or issue (red). Dot size = severity (1-5). Grouped under parent themes so you see the domain first, then the specific. Severity 5 issue dots pulse gold — they're the crises that need immediate attention.
Category Balance
For each category, green bars show months as a strength, red bars show months as an issue. Grouped by theme. Instantly shows what's net-positive (desserts, atmosphere) vs net-negative (food quality, service timing).
Monthly Detail Table
Every month's full analysis: momentum badge, severity-rated strength and issue pills (gold for severity 4-5), the LLM's owner-focused summary, and specific action items. Searchable and sortable.
7 Why This Matters
Traditional review analysis counts stars and extracts keywords. This approach reads and understands the reviews. A 4-star review that says "great food but I got sick afterward" gets flagged as a severity-5 safety issue despite the high rating. A 2-star review that's really about parking doesn't tank the food score.
The entire pipeline is SQL-native. No Python ML pipelines, no separate ETL jobs, no
dashboarding tools. DuckDB reads the data, VGI aggregate functions call Claude, and Jinja2 renders the HTML.
The same qf_llm_summarize() function works for monthly analysis, category normalization, and
theme grouping — it's just different prompts.
And it costs about $3 total to analyze 11 years of reviews for one restaurant. Scale it to a restaurant group with 50 locations and you're looking at $150 for comprehensive operational intelligence across every location, every month, with actionable specifics.