This is a good way to benchmark models. We [the SWE-bench team] took the meta-version of this and implemented it as a new benchmark called CodeClash -
We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an agent written by Claude plays poker against an agent written by GPT, and this really tough task leads to very interesting findings on AI for coding.
I'd really like to see them add a complex open world fully physicalized game like Star Citizen (assuming the game itself is stable) with a single primary goal like accumulating currency as a measure of general autonomy and a proxy for how the model might behave in the real world given access to a bipedal robot.
My personal threshold for AGI is when an AI can 'sit down' - it doesn't need to have robotic hands, but it needs to only use visual and audio inputs to make its moves - and complete a modern RPG or FPS single player game that it hasn't pre-trained on (it can train on older games).
If AI can program, why does it matter if it can play Chess using CoT when it can program a Chess Engine instead? This applies to other domains as well.
> If AI can program, why does it matter if it can play Chess using CoT when it can program a Chess Engine instead?
Heh, we really did come full circle on this! When chatgpt launched in dec22 one of the first things that people noticed is that it sucked at math. Like basic math 12 + 35 would trip it up. Then people "discovered" tool use, and added a calculator. And everyone was like "well, that's cheating, of course it can use a calculator, but look it can't do the simple addition logic"... And now here we are :)
This isn't my child. It is a program. I need it to get task X done and I couldn't care less how it is done whether it is strictly through CoT or with tools. There is no such thing as cheating in real work and no reason to handicap it. Just test the limits of what it can do with whatever means possible.
Trying to solve everything with CoT alone seems futile.
A lot of the insights of math come from knowing how to do things efficiently. That’s why the tests are timed. I don’t know, this is pretty basic pedagogy that you are choosing to grief.
They should be allowed to! In fact i think better benchmark would be to invent new games and test the models ability to allocate compute to minmax/alphazero new games in compute constraints
Poker has very high variance, you'd need several hundred thousand hands to confidently say who's better. Also, you probably want to precompute the GTO-optimal play for benchmarking purposes.
Wow. I'm generally in the AI maximalist camp. But adding Werewolf feels dangerous to me. Anyone who's played knows lying, deceipt, and manipulation is often key to winning. We really want models climbing this benchmark?
It’s not that bad. I’ve been using 3 Pro for some time now and I’m quite happy with how it works. Best paired with Opus and Codex, like most models, but it’s solid as a full-stack buddy.
This is a good way to benchmark models. We [the SWE-bench team] took the meta-version of this and implemented it as a new benchmark called CodeClash -
We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an agent written by Claude plays poker against an agent written by GPT, and this really tough task leads to very interesting findings on AI for coding.
https://codeclash.ai/
>this really tough task leads to very interesting findings on AI for coding
Are you going to share those with the class or?
Cool to see core war! I feel it's mostly forgotten by now. My dad is still playing it to this day though and even attends tournaments
Leaderboard looks very outdated..
I'd really like to see them add a complex open world fully physicalized game like Star Citizen (assuming the game itself is stable) with a single primary goal like accumulating currency as a measure of general autonomy and a proxy for how the model might behave in the real world given access to a bipedal robot.
My personal threshold for AGI is when an AI can 'sit down' - it doesn't need to have robotic hands, but it needs to only use visual and audio inputs to make its moves - and complete a modern RPG or FPS single player game that it hasn't pre-trained on (it can train on older games).
https://arxiv.org/abs/2507.03793
If AI can program, why does it matter if it can play Chess using CoT when it can program a Chess Engine instead? This applies to other domains as well.
> If AI can program, why does it matter if it can play Chess using CoT when it can program a Chess Engine instead?
Heh, we really did come full circle on this! When chatgpt launched in dec22 one of the first things that people noticed is that it sucked at math. Like basic math 12 + 35 would trip it up. Then people "discovered" tool use, and added a calculator. And everyone was like "well, that's cheating, of course it can use a calculator, but look it can't do the simple addition logic"... And now here we are :)
Its the same reason we are asked to write exams without using calculators but the real world does have them.
How you work without calculators is a proxy for real world competency.
Funny, you used probably the most useless form of benchmarking used on people as an example of measuring "competency" in the real world.
are you in favour of children using calculators in exams?
This isn't my child. It is a program. I need it to get task X done and I couldn't care less how it is done whether it is strictly through CoT or with tools. There is no such thing as cheating in real work and no reason to handicap it. Just test the limits of what it can do with whatever means possible.
Trying to solve everything with CoT alone seems futile.
you are not understanding. its a proxy for how well it does other things.
A lot of the insights of math come from knowing how to do things efficiently. That’s why the tests are timed. I don’t know, this is pretty basic pedagogy that you are choosing to grief.
They should be allowed to! In fact i think better benchmark would be to invent new games and test the models ability to allocate compute to minmax/alphazero new games in compute constraints
How about nethack?
Curious why they decided to curate poker hands instead of a normal poker
Poker has very high variance, you'd need several hundred thousand hands to confidently say who's better. Also, you probably want to precompute the GTO-optimal play for benchmarking purposes.
But can't computers play several hundred thousand poker hands easily in a couple of hours ?
But now because the hands are so strong we don't see any folds
making models target benchmark about being good at lying and getting away with it (werewolf) is certainly an interesting choice
Wow. I'm generally in the AI maximalist camp. But adding Werewolf feels dangerous to me. Anyone who's played knows lying, deceipt, and manipulation is often key to winning. We really want models climbing this benchmark?
Good question, but who's going to stop them?
AI already has a very creative imagination for role play so this just adds extra to their arsenal.
confidently and charismatically lying to clueless users has been one of fundaments of AI adoption
Anecdotal data point, but recently I’ve found Gemini to perform better than ChatGPT when it came to intent analysis.
Gemini tops all benchmarks but when it comes to real world usage it is genuinely unusable
It’s not that bad. I’ve been using 3 Pro for some time now and I’m quite happy with how it works. Best paired with Opus and Codex, like most models, but it’s solid as a full-stack buddy.