Show HN: A new benchmark for testing LLMs for deterministic outputs

(interfaze.ai)

60 points | by khurdula 6 days ago ago

29 comments

submarius 21 hours ago ago
Cool work — quick question: how should readers think about the fact that Interfaze-Beta is on the leaderboard you built? Not saying anything's wrong with the methodology, just curious how you'd recommend a third party verify the ranking is neutral to the choices you made (datasets, difficulty weights, reasoning-off default, etc.).
jumploops 6 days ago ago
I have anecdotal experience here, but I've found more success when solving the task first, and then returning it as JSON in a separate LLM call[0].
Running a single non-reasoning LLM call from source data (text/image/audio in your diagram) to structured JSON seems fragile with the current state of LLMs.
You're essentially asking the model to do two tasks in one pass: parse the input and then format the output. It's amazing it works a lot of the time, but reasonable to assume it won't all of the time.
(As a human, when I'm filling out a complex form, I'll often jump around the document)
Curious how the benchmarks change when you add an intermediary representation, either via reasoning or an additional LLM call. I'd also love to see a comparison with BAML[1].
[0]In my experience we were using structured outputs as part of an agentic state machine, where the JSON contained code snippets (html/js/py/etc.). In the cases where we first prompted the model for the code, and then wrapped it in JSON, we saw much higher quality/success than asking for JSON straightaway.
[1]https://boundaryml.com/
stared 6 days ago ago
Thank you for sharing benchmark. However, the results are selective.
Why no Opus 4.7? Why Gemini 3.1 Pro is missing?
If there is some other criterion (e.g. models within certain time or budget), great - just make it explicit.
When I see "Top 5 at a glance" and it missed key frontier models, I am (at best) confused.
[-]
- khurdula 6 days ago ago
  Yeah we selected models that are most commonly integrated in developer workflows and being used for structured output. Typically those models tend to be in the low -mid cost range and with no or low reasoning.
  For the benchmark, was kept consistent across all models and typically opus and 3.1 pro would be overkill and expensive even with reasoning off.
  Good point tho, will add this point in the blog too :)
  Also the benchmark is open source, so anyone can run a model on it and create a PR too, the leaderboard is dynamic and will automatically add that in.
  [-]
  - staticshock 6 days ago ago
    The value of such a benchmark, to me, would be, "what is peak performance", not just "what is mid-tier performance". Also, possibly, "what's the per-dollar performance". Time and money permitting, I'd really want to see your benchmark extended to the large reasoning models.
  - stared 6 days ago ago
    Then the way to go is to use Pareto frontier, e.g. https://quesma.com/benchmarks/binaryaudit/#cost
    If you want to avoid using Opus 4.7 them why GPT-5.4 (unless with a disclaimer that it is low reasoning setting, or check that on medium its price is comparable with Haiku/Flash).
    Also, usually it is good to look at the newest model. Gemini 2.5 Flash is quite dated. Gemini 3.1 Flash Lite is the new one (https://openrouter.ai/google/gemini-3.1-flash-lite-preview).
- Flux159 6 days ago ago
  Agree that the choices are strange. Sonnet 4.6 was tested, but no Opus 4.6.
  Gemini 3.1 and GLM 5 came out around the same time as Sonnet 4.6 (~Feb 2026) so it's strange that they are missing, but Gemini 2.5 Flash, Gemini 3 Flash, and GLM 4.7 are there.
  [-]
  - khurdula 4 days ago ago
    We've updated our leaderboard having evaluated frontier models gemini 3.1 pro, opus 4.6 & 4.7, glm 5.1, deepseek v4, Kimi K2.6 as well.
ossianericson 6 days ago ago
Even when the JSON pass rate is at 97% the real challenge is that the accuracy gap is invisible at the record level. Nothing flags it without a baseline to check against. Parse error is rarely where it goes wrong in my experience. 'Valid' but incorrect data is what actually reaches production.
zihotki 6 days ago ago
I wonder if this benchmark brings any value. Models are already quite capable and reach high scores in it.
[-]
- khurdula 6 days ago ago
  Check out the "The JSON-pass vs Value-Accuracy gap" section in the blog. That was an eye opener.
  While most models were great at producing JSON schema, they were pretty bad at producing accurate values.
  In the graph you'll is almost a 20%-30% drop between the JSON schema pass vs the value accuracy.
timxtokyo 6 days ago ago
Would it be possible to add llm provider from glm5.1, minimax2.1? Those latest model have their parameters change significantly compare to previous gen
[-]
- khurdula 5 days ago ago
  We're updating our leaderboard with these model scores, should be out soon :D
jadbox 6 days ago ago
Wow, Qwen3.5-35B is absolutely punching above its weight. Perhaps it's the best/cheapest model for just JSON operations?
[-]
- khurdula 5 days ago ago
  We do love Qwen! It can be an easy choice when confused looking at this leaderboard.
broyojo 6 days ago ago
hmm why can't structured decoding be used?
[-]
- khurdula 6 days ago ago
  We saw that structured decoding didn't make a difference in the quality of the output.
  Check out the paper section "6.3 Structured Decoding Ablation"
  Paper: https://arxiv.org/pdf/2604.25359
  We ran the comparison and saw no difference, so to keep the bench consistent since some models don't support structured decoding we used greedy decoding on all models.
maxdo 6 days ago ago
gpt 5.5 seems to be the recent leader overall, it make sense to include it , just to see what you trade off for speed/open source nature vs cutting edge leader.
[-]
- khurdula 4 days ago ago
  hey! we've evaluated gpt 5.5 as well along with other frontier models. gemini and gemma models outperform it across all three modalities.
  Open source models like glm 4.7 still compete closely with table toppers.
- khurdula 6 days ago ago
  Yep, we will be adding it soon as well.
iLoveOncall 6 days ago ago
This is just a hallucinations benchmark on a subset of outputs, not sure there's a value over general hallucinations benchmarks?
> Our goal is to be the best general model for deterministic tasks
I'm sorry but this simply doesn't make sense. If you want a deterministic output don't use an LLM.
[-]
- nemo1618 6 days ago ago
  LLMs are not inherently non-deterministic. This is a common misconception. You used to be able to set temp=0 and a fixed seed and get the same output every time. This broke when labs started implementing batching, and no one bothered fixing it because the benefits of batching vastly outweighed the demand for deterministic output.
  I am hopeful deterministic output will return, though; DeepSeek v4 claims to have implemented "bitwise batch-invariant and deterministic kernels," though I haven't tested it myself.
  [-]
  - iLoveOncall 6 days ago ago
    > LLMs are not inherently non-deterministic.
    Reproducible does not mean deterministic. You cannot determine in advance what a prompt will give as output, even with a temperature of 0 and a fixed seed, therefore they are not deterministic.
    [-]
    - nemo1618 5 days ago ago
      Huh? I'm not aware of anyone else who defines "deterministic" that way. "Deterministic" comes from "determinism," as in "the effects are fully determined by the causes" -- not "determine" as in "deduce."
  - sroussey 6 days ago ago
    Thinking Machines Lab uses batch invariant kernels, btw.
- khurdula 6 days ago ago
  General hallucinations benchmarks tend to be knowledge specific like GPQA or MMLU but none specifically measure structured output end-to-end which is one of the biggest use case for LLMs.
  Many developer workflows use LLMs to produce structured artifacts due to it's flexibility of consuming unstructured inputs.
  > "don't use an LLM"
  Partially agree, that's what we're building towards at interfaze.ai a hybrid between transformers (LLMs) and traditional CNN/DNN architecture to solve this problem of "deterministic" output. This give devs the flexibility of custom schema definitions and unstructured input while still getting high quality structured output like you would get from a CNN models like EasyOCR.
  The industry is moving toward using LLMs for more and more deterministic tasks so this benchmarks allows us to now measure it.
dalberto 6 days ago ago
A benchmark without Opus 4.6/4.7 feels incomplete.
[-]
- khurdula 4 days ago ago
  We've added opus 4.6 and 4.7 to our leaderboard, they perform very closely with sonnet 4.6. Feel free to checkout our updated blog again :D
- khurdula 6 days ago ago
  Due to high demand, we're adding it soon!