Repeated runs on selected models
Three models — claude-sonnet-4-6, gpt-5.2-2025-12-11, and gpt-5.3-codex — were each run 20 times on the same 11 questions to measure consistency and variance — not just average accuracy, but how much results fluctuate across runs.
This page compares two configurations:
- Without modeling — SQL is generated directly against raw DDL, with no additional dbt models. The Semantic Layer works but cannot answer the 3 "too many hops" questions, which require joins it cannot express.
- With modeling — Additional dbt models were created to resolve the hop limitations. This unlocks those 3 questions for the Semantic Layer and gives the SQL generator a richer schema to work with.
Without modeling
Without additional dbt models, the Semantic Layer cannot answer the 3 "too many hops" questions — it will always score 0% on those. Including them would unfairly drag down its overall accuracy. Use the filter below to explore all questions, or isolate the too-many-hops questions to see how each method handles an unanswerable request.
With modeling
With additional dbt models in place, the "too many hops" questions are no longer a limitation — the Semantic Layer can now answer all 11 questions. There is no filter here because all questions are meaningful and excluding any of them would hide the key result: that modeling resolves the hop problem entirely.
