Repeated runs on selected models

Three models — claude-sonnet-4-6, gpt-5.2-2025-12-11, and gpt-5.3-codex — were each run 20 times on the same 11 questions to measure consistency and variance — not just average accuracy, but how much results fluctuate across runs.

This page compares two configurations:

Without modeling — SQL is generated directly against raw DDL, with no additional dbt models. The Semantic Layer works but cannot answer the 3 "too many hops" questions, which require joins it cannot express.
With modeling — Additional dbt models were created to resolve the hop limitations. This unlocks those 3 questions for the Semantic Layer and gives the SQL generator a richer schema to work with.

Without modeling

Without additional dbt models, the Semantic Layer cannot answer the 3 "too many hops" questions — it will always score 0% on those. Including them would unfairly drag down its overall accuracy. Use the filter below to explore all questions, or isolate the too-many-hops questions to see how each method handles an unanswerable request.

No Results

No Results

With modeling

With additional dbt models in place, the "too many hops" questions are no longer a limitation — the Semantic Layer can now answer all 11 questions. There is no filter here because all questions are meaningful and excluding any of them would hide the key result: that modeling resolves the hop problem entirely.

No Results

No Results