EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

(esolang-bench.vercel.app)

53 points | by matt_d 3 hours ago ago

18 comments

orthoxerox 2 hours ago ago
> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning.
I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?
Or does this simply show that esolangs are hard to reason in by design? A more honest approach would use a "real", but relatively unpopular, language. Make them use CoffeeScript or Ada or PL/I or Odin or that other systems programming language that that very opinionated guy is implementing on top of QBE.
[-]
- onoesworkacct 31 minutes ago ago
  Unlike AI, you aren't able to regurgitate entire programs and patterns you've seen before.
  AI's capacity for memorisation is unrivaled, I find it mind blowing that you can download a tiny ~4gb model and it will have vastly more general knowledge than an average human (considering that the human is more likely to be wrong if you ask it trivia about e.g. the spanish civil war).
  But the average human still has actual reasoning capabilities, which is still (I think?) a debated point with AI.
- IsTom an hour ago ago
  Just look what kind of problems the easy task set is (hello world, echo line, count vowels, etc.). With best being ~10% of total in brainfuck this is 10 out of 20. You can google more solutions to these problems than that.
- andai an hour ago ago
  Yeah there seem to be two axes here.
  Esolang vs mainstream paradigm.
  Popular vs scarce training data.
  So you'd want to control for training data (e.g. brainfuck vs Odin?)
  And ideally you'd control by getting it down to 0, i.e. inventing new programming languages with various properties and testing the LLMs on those.
  I think that would be a useful benchmark for other reasons. It would measure the LLMs' ability to "learn" on the spot. From what I understand, this remains an underdeveloped area of their intelligence. (And may not be solvable with current architectures.)
- wavemode an hour ago ago
  > I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?
  Setting aside whether this benchmark is meaningful or not - the argument you're making is faulty. There are indeed humans who can write complete programs in Brainfuck and these other esolangs. The fact that you personally can't is not logically relevant.
  [-]
  - Groxx an hour ago ago
    particularly if you'd already read approximately all written material in existence about those languages. many humans are capable of learning a language from the documentation.
- iloveoof an hour ago ago
  Try MUMPS, widely used but little training data online. Probably less than some esolangs
  [-]
  - twoodfin 7 minutes ago ago
    Frontier models have gotten much better at ObjectScript (the InterSystems evolution of MUMPS/M).
    Palindrome:
    https://chatgpt.com/s/t_69bc8d8c116c8191a339a33f0fbcc935
    This is a noticeable improvement from a year ago.
    I wish it would use Return instead of Quit but that’s a stochastic parrot for you.
bwestergard 3 hours ago ago
I'm shocked to see how poorly these models, which I find useful day to day, do in solving virtually any of the problems in Unlambda.
Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.
But the model that did the best, Qwen-235B, got virtually every problem wrong.
[-]
- __alexs 3 hours ago ago
  They are also weirdly bad at Brainfuck which is basically just a subset of C.
  [-]
  - astrange 8 minutes ago ago
    BF involves a lot of repeated symbols, which is hard for tokenized models. Same problem as r's in strawberry.
groar 15 minutes ago ago
I guess if you tell codex to build a transpiler from a subset of python to brainfuck, then solve in that subset of python, it would work much better. Would that be cheating?
__alexs 3 hours ago ago
I had hope we might finally be ushering in a bold new era of programming in Malbolge but apparently that was too optimistic.
simianwords 2 hours ago ago
I bet I can do better by allowing this: the llm can pull documentation of the language from the web to understand how it works.
If the llm has “skills” for that language, it will definitely increase accuracy.
rubyn00bie 24 minutes ago ago
I am not surprised by this, and am glad to see a test like this. One thing that keeps popping up for me when using LLMs is the lack of actual understanding. I write Elixir primarily and I can say without a doubt, that none of the frontier models understand concurrency in OTP/Beam. They look like they do, but they’ll often resort to weird code that doesn’t understand how “actors” work. It’s an imitation of understanding that is averaging all the concurrency code it has seen in training. With the end result being huge amount of noise, when those averages aren’t enough, guarding against things that won’t happen, because they can’t… or they actively introduce race conditions because they don’t understand how message passing works.
Current frontier models are really good at generating boiler plate, and really good at summarizing, but really lack the ability to actually comprehend and reason about what’s going on. I think this sort of test really highlights that. And is a nice reminder that, the LLMs, are only as good as their training data.
When an LLM or some other kind of model does start to score well on tests like this, I’d expect to see better them discovering new results, solutions, and approaches to questions/problems. Compared to how they work now, where they generally only seem to uncover answers that have been obfuscated but are present.
deklesen 3 hours ago ago
Mhh... my hunch is that part of this is that all python keywords are 1 token, I assume. And for those very weird languages, tokenizing might make it harder to reason over those tokens.
Would love to see how the benchmarks results change if the esoteric languages are changed a bit to make them have 1-token keywords only.
[-]
- chychiu 3 hours ago ago
  Considering that brainfuck only has 8 characters and models are scoring at 6.2% I don't think tokenization is the issue
  [-]
  - altruios 2 hours ago ago
    The only issue. *
    Reasoning is hard, reasoning about colors while wearing glasses that obfuscate the real colors... even harder... but not the core issue if your brain not wired correctly to reason.
    I suspect the way out of this is to separate knowledge from reason: to train reasoning with zero knowledge and zero language... and then to train language on top of a pre-trained-for-reasoning model.