LLM Semantic Layer Benchmark

This benchmark tests whether LLMs can answer business questions more accurately using a Semantic Layer (MetricFlow) versus generating raw SQL directly. Each model is given 11 insurance-domain questions about claims, policies, and premiums, and asked to answer them via both methods. Results are compared against known correct answers.

The questions are an extract from the ACME Insurance benchmark, originally created by data.world.

Benchmark Questions

No Results

3 of the 11 questions are "too many hops" — they require joins that the Semantic Layer cannot express.

A note on modeling

For some benchmark runs, additional dbt models were created specifically to resolve the "too many hops" limitation. With those models in place, the Semantic Layer can answer all 11 questions — including the ones that would otherwise be out of reach. Pages that include these runs are clearly labelled "with modeling".