> However, this result made it clear that the reliability of state-of-the-art LLMs is fundamentally limited: If they need to complete every step correctly in order to solve a task, after a certain number of steps they will almost surely fail as a result of an underlying propensity to make errors, even when the answer should be obvious. While an error rate of 1-in-1,000 seems low, and would be great on a traditional LLM benchmark, on a task that requires successful execution of thousands of steps in a row, such a system results in inevitable failure.
What a relief to see an obvious problem actually acknowledged. I can't even guess how many times I've been shouted down about this exact topic in the reasoning debates on HN, or seen papers just kind of glossing over it as if it were a non-issue.
The next really natural question is.. if you're committed to decomposing a problem into tons of microsteps and voting.. why aren't we just embracing hybrid symbolic systems? The decomposition step kind of implies you're in a problem domain where variables separate out somewhat cleanly and that this should be doable. As far as I can tell the "voting" discussed in the paper is about candidate outputs, i.e. solutions to subproblems? If you switch to hybrid symbolic systems, then you can vote on candidate inputs to solvers and at least be damned sure that their output is always correct.
Also the success of chain-of-code compared with chain-of-thought approaches could actually imply that having no real solver is maybe not the obstacle you'd expect! Maybe you can invent a semiformal logic just in time that appears to be expressive enough to encapsulate the problem domain, and have the LLM emulate a nonexistent solver. If the error rate with this sort of approach is still too high, then at least you know concretely what solver or formal-language you need to implement in order to improve.
My own attempt at "chain-of-code with a Prolog DSL": https://news.ycombinator.com/item?id=45937480. Similarly to CodeAct the idea there is to turn natural language task descriptions into small programs. Some program steps are directly executed, some are handed over to an LLM. I haven't run any benchmarks yet, but there should be some classes of tasks where such an approach is more reliable than a "traditional" LLM/tool-calling loop.
Prolog seemed like a natural choice for this (at least to me :-), since it's a relatively simple language that makes it easy to build meta-interpreters and allows for a fairly concise task/workflow representations.
Nice, I do like the direction. A prolog dialect does seem like a natural choice if we must pick only one kind of intermediate representation, but ideally there could be multiple. For example, I saw your "legal reasoning" example.. did you know about https://catala-lang.org/ ? I think I'd like to see an LLM experiment that only outputs formal specifications, but still supports multiple targets (say prolog, z3, storm, prism, alloy and what have you). After you can output these things you can use them in chain-of-code.
Anyway the basic point being.. it is no wonder LLM reasoning abilities suck when we have no decent intermediate representation for "thinking" in terms of set/probability primitives. And it is no wonder LLMs suck at larger code-gen tasks when we have no decent intermediate representation for "thinking" in terms of abstract specifications. The obsession with natural-language inputs/intermediates has been a surprise to me. LLMs are compilers, and we need to walk with various spec -> spec compilers first so that we can run with spec -> code compilers
Thank you, https://catala-lang.org/ looks very interesting. I've experimented a lot with LLMs producing formal representations of facts and rules. What I've observed is that the resulting systems usually lose a lot of the original generalization capabilities offered by the current generation of LLMs (Finetuning may help in this case, but is often impractical due to missing training data). Together with the usual closed world assumption in e.g. Prolog, this leads to imho overly restrictive applications. So the approach I am taking is to allow the LLM to generate Prolog code that may contain predicates which are interpreted by an LLM.
So one could e.g. have
is_a(dog, animal).
is_a(Item, Category) :- @("This predicate should be true if 'Item' is in the category 'Category'").
In this example, evaluation of the is_a predicate would first try to apply the first rule and if that fails fallback on to the second rule branch which goes into the LLM. That way the system as a whole does not always fail, if the formal knowledge representation is incomplete.
I've also been thinking about the Spec->Spec compilation use case. So the original Spec could be turned into something like:
I am honestly not sure where such an approach might ultimately be most valuable. "Anything-tools" like LLMs make it surprisingly hard to focus on an individual use case.
> While an error rate of 1-in-1,000 seems low, [...], on a task that requires successful execution of thousands of steps in a row, such a system results in inevitable failure.
This is also why (edit: non-LIDAR) FSD cars are an illusion.
FSD isn't, and never was, a sensor problem. It's an AI problem. Always was. Always will be.
Humans drive around with two mid-tier cameras on a pivot mount. Which means that any sufficiently advanced AI can do the same.
When a FSD car gets into an avoidable collision, you dump the blackbox data, and what do you see? You see that the cameras had it. All the information the car needed to avoid a collision was right there in the visual stream. The car had every bit of information it needed to make the right call, and didn't make the right call.
You can acknowledge that, and focus on building better AIs. Or you can neglect AI altogether, and have a car with 6 LIDARs drag a pedestrian, because it had all the sensor coverage but zero object permanence.
I am annoyed to no end by all the LIDAR wankery - while in practice, LIDAR systems don't provide much of an advantage over camera only systems, and consistently hit the same limitations on the AI side of things.
Nonetheless, there is no shortage of people who, for some reason, think that LIDAR is somehow a silver bullet that solves literally everything.
So why do you think the only reliable FSD car out there is built around an expensive LIDAR system?
LIDAR may not solve everything, but the point is that it allows for greater safety margins. All the non-safety-critical parts can be done with AI, yes.
In simple terms, the LIDAR sensor will allow you to do "If object at X, don't go to X". But obviously, you need more than that. Old school Kalman filters for object tracking etc.
Raw dog LIDAR and "old school kalman filters" don't give you anywhere near good enough performance.
Want to know how poor performance looks like in practice? Like Tesla phantom braking but ten times worse. And if you dial it down to avoid false positives, then it stops exerting any control over AI, and you're back to getting your AI to work well.
It's interesting that this is a level of tech reductionism that is really common right now and that it's not more openly challenged by engineers. Tough intractable problem? AI. How close are we? Soon! How soon? I don't know, I don't work on that problem.
Of course FSD is solvable with advanced AI and the same applies to all other problems but we don't yet have this level of AI and we don't know how far away we are from reaching it.
Companies that bet on assistive AI solutions (i.e. more sensors to plug AI gaps) will win and have the best chance at eventually reaching the level of AI where additional sensors are no longer needed.
Companies that go all-in on perfect AI have a very, very high chance of failure, not because they're not smart enough, or not driven enough, or are capital constrained, but because they dont fully understand the scope of the problem.
Also worth noting they are heavily incentivised to pump the AI bubble for existential reasons, and so their AI progress forecasts are not trustworthy.
"Reductionism" is right. If you could just always "plug the gaps with more sensors", then the car with 900 cameras and 400 LIDARs would have reached L4 autonomy back in year 2010.
It doesn't work like that. No amount of sensors can salvage piss poor driving AI. The gains from more and better sensors bottom out pretty fast. You completely saturate your ability to sink and fuse sensor data long before your driving actually gets good.
LIDAR doesn’t stop them bumping into anything. LIDAR is a sensor, it doesn’t recognise anything or make decisions about steering, acceleration, or braking.
You need more than the sensor, obviously. The point is that you don't need any AI to make a system this way that is substantially safer than a system based only on camera feeds and AI.
> > While an error rate of 1-in-1,000 seems low, [...], on a task that requires successful execution of thousands of steps in a row, such a system results in inevitable failure.
> This is also why (edit: non-LIDAR) FSD cars are an illusion.
In this scenario, Waymo’s AI is executing thousands of steps in a row. The fact that it uses LIDAR for sensing doesn’t change that. It’s still AI driving you around no matter what its eyes are made of.
Waymo is a counterexample to the point you were making and their use of LIDAR doesn’t change that.
No because safety is guaranteed by the LIDAR and navigation is done by GPS+classical algorithms. Mistakes made by the AI can be overcome by those two non-AI approaches + reiterating the AI-based steps.
Of course, we struggle to get humans to low error rates on large number of steps in sequence too, to the point where we devote vast amount of resources to teaching discipline, using checklists, doing audits and reviews to coax reliability out of an unreliable process.
So nobody should be surprised that this also applies to LLMs.
The issue is when people assumes that a zero failure rate, or even close to zero, is necessary for utility, even though we don't need that from humans for humans to be useful for complex tasks.
For a whole lot of tasks, the acceptable error rate boils down to how costly it is to work around, and that is a function of the error rate, consequence of an error that slips past, and the cost of a "reliable enough" detector to let us mitigate to whatever extent is cost effective by using one or more detection steps.
For a lot of uses, voting or putting the AI in a loop, produces a good enough results cheap enough. For some it will require models with lower error rates first.
For some applications, sure, maybe solvers will be part of that, or in the mix. As will a lot of other tools. E.g. Claude likes to try to bisect when I ask it to fix a parser problem, and Claude is really bad at doing sensible bisection, so I had it write a dumb little bisection tool instead, and told it steps to solve this type of problem that includes using that tool. So when we can have planning steps output "microsteps" that we can automate with more deterministic tools, then we absolutely should.
Heck, the models themselves "likes" to write tools to automate if you give them long lists of tedious little tasks to do, to the point it's effort to make them not do it even when they have to write the tools themselves.
> The issue is when people assumes that a zero failure rate, or even close to zero, is necessary for utility, even though we don't need that from humans for humans to be useful for complex tasks.
This argument doesn't carry because it is beside the point. Human vs. LLM utility parity isn't a sensible stop-goal for improvement. New technology isn't adopted for its legacy parity. Nor are there any specific technical barriers around human parity.
Fewer mistakes than humans, by definition, delivers unique value. People also want to spin up LLMs to handle tasks at scale in ways humans never could, where human level mistakes would be unacceptable.
So we very much do need LLMs (or whatever we call them tomorrow) to operate with lower error bars than humans. It is a reasonable demand. Lots of applications are waiting.
Given that demand, the value of avoiding any mistake, and the many people working on it, error rates will keep falling indefinitely.
> This argument doesn't carry because it is beside the point. Human vs. LLM utility parity isn't a sensible stop-goal for improvement. New technology isn't adopted for its legacy parity. Nor are there any specific technical barriers around human parity.
This is just utter nonsense. New technology is sometimes adopted because it is better, but just as often adopted even when the quality is strictly worse if it is cheaper.
But apart from that you appear to arguing against a point I never made, so it's not clear to me what the point of your response is.
> Fewer mistakes than humans, by definition, delivers unique value.
Yes, but that is entirely irrelevant to the argument I made.
> Given that demand, the value of avoiding any mistake, and the many people working on it, error rates will keep falling indefinitely.
And this is also entirely irrelevant to the point I made, and not something I've ever argued against.
For a comprehensive rebuttal to this point of view, you may be interested in the works of W. Edwards Deming.
“No one knows the cost of a defective product - don't tell me you do. You know the cost of replacing it, but not the cost of a dissatisfied customer.” -Deming
> we struggle to get humans to low error rates on large number of steps in sequence too
Who said anything about AI vs humans? The contest in this context would be AI vs classical deterministic code, algorithms, solvers
> how costly it is to work around .. a function of the error rate, consequence of an error that slips past, the cost of a "reliable enough" detector.. produces a good enough results cheap enough.
I mean, you're right, but only sort of. Someone can use this same argument to justify the assertion that bogosort is really the pinnacle of engineering excellence. How would you respond?
> The contest in this context would be AI vs classical deterministic code, algorithms, solvers
No, it is not. In cases where we know how to solve things that way, we probably should, on the assumption that if they can deliver good enough results they are likely cheaper.
Those are not the things we generally are trying to use LLMs for.
> I mean, you're right, but only sort of. Someone can use this same argument to justify the assertion that bogosort is really the pinnacle of engineering excellence. How would you respond?
That it is an obivously specious argument, because we have clearly lower cost sort algorithms, and so no, you can't use this same argument to justify that assertion.
Briefly, the idea is recursively to decompose tasks into the simplest possible steps, recursively call (relatively small) LLMs as agents to execute one step at a time, and using a clever voting scheme to choose how to execute each step. The authors use this technique to get a relatively small LLM to solve Towers of Hanoi with 20 rings (1M steps). All of it using natural language.
The most obvious question is whether other tasks, more interesting -- less "rote" -- than Towers of Hanoi, can similarly be recursively decomposed into simple steps. I'm not sure that's always possible.
> into the simplest possible steps, recursively call (relatively small) LLMs as agents to execute one step at a time, and using a clever voting scheme to choose how to execute each step.
One issue I often run into with this stuff is the tightly coupled nature of things in the real world. I’ll fashion an example:
Let’s say you break a job down into 3 tasks: A, B and C. Doing one of those tasks is too much for an LLM to accomplish in one turn (this is something you learn intuitively through experience), but an LLM could break each task into 3 subtasks. So you do that, and start by having the LLM break task A into subtasks A1, A2 and A3. And B into B1, B2 and B3. But when you break down task C, the LLM (which needs to start with a fresh context each time since each “breakdown” uses 60-70% of the context) doesn’t know the details of task A, and thus writes a prompt for C1 that is incompatible with “the world where A1 has been completed”.
This sort of “tunnel vision” is currently an issue with scaling 2025 agents. As useful context lengths get longer it’ll get easier, but figuring out how to pack exactly the right info into a context is tough, especially when the tool you’d reach for to automate it (LLMs) are the same tool that suffers from these context limitations.
None of this means big things aren’t possible, just that the fussyness of these systems increases with the size of the task, and that fussyness leads to more requirements of “human review” in the process.
I've been experimenting with this with a custom /plan slash command for claude code, available here: https://github.com/atomCAD/agents
Planning is definitely still something that requires a human in the loop, but I have been able to avoid the problem you are describing. It does require some trickery (not yet represented in the /plan command) when the overall plan exceeds reasonable context window size (~20k tokens). You basically have to start having the AI consider combinatorially many batches of the plan compared with each other, to discover and correct these dependency issues.
>the LLM (which needs to start with a fresh context each time since each “breakdown” uses 60-70% of the context) doesn’t know the details of task A, and thus writes a prompt for C1 that is incompatible with “the world where A1 has been completed”.
Can't that be solved with sub agents? The main agents oversees on combines code and calls sub agents for each tasks.
Reasoning by analogy is great for intuition, but doesn’t guarantee real results hold. Consider “voltage is like water pressure in pipes, so if there’s a cut in my wire’s insulation, the device won’t get enough voltage” — clearly this is not true, even though it relies on an analogy that’s generally useful.
I really like that analogy, thank you for it. Also applies to “it’s overvoltage, so I just need to poke a little hole in it to let the excess bleed out”…
> "If there’s a cut in my wire’s insulation, the device won’t get enough voltage" doesn't follow from: "voltage is like water pressure in pipes"
I absolutely agree! In the same way, "an LLM can solve complex problems if it breaks them into subtasks" doesn't follow from "NASA breaks large projects into smaller parts"
This is a really good analogy because the complex intersections between multiple groups independently working and trying to collaborate together into a collaborative hierarchy towards one large goal was one of the things that hid a lot of the problems that led to the Challenger disaster, according to Feynmen.
I'm pretty sure the problem with the shuttle was that it had too many (possibly conflicting) goals instead of one large goal.
It's manned, even though most launches probably could be done without crew. The deadly Challenger launch was risking human crew for something as mundane as launching two satellites into space.
Because it's manned, it has to be able to land at airports, because retrieving astronauts at sea is an unreasonable complication for launching a satellite. Damage to the wings will cause loss of the entire aircraft, something that is unlikely to happen to a capsule.
Because it is a horizontal landing system, the aerodynamics favor putting the shuttle on the same level as the external fuel tank, which exposes the wing to debris from the top of the external fuel tank. If you try building a vertical shuttle in KSP, you will notice that the wings give you too much control authority during launch. Fins are best placed near the bottom of the rocket.
It's reusable, which means wear and tear can secretly accumulate without you noticing. This significantly increases the design requirements for the critical components, like the SRB that had a poor "tang" design, which, as it turns out, was definitively not fit for reuse.
The space shuttle’s design was also deeply flawed to the point it failed to do the core objective, significantly lowering costs. Instead the core mission was sacrificed to meet some arbitrary design goals such as being able to de-orbit heavy objects.
That’s the core issue with decomposition of tasks, you aren’t communicating back up the chain and finding globally optimal solutions unless the task is simple enough to be completely understood.
IBM tried that with CMM (capability maturity model), it didn't work, the problem is NASA knows what they're building, rockets and satellites don't have any grey areas and everything is specified. Other things are less well defined, and the people specifying aren't rocket scientists.
> The approach relies on an extreme decomposition of a task into subtasks, each of which can be tackled by focused microagents. The high level of modularity resulting from the decomposition allows error correction to be applied at each step through an efficient multi-agent voting scheme.
Big if that the decomposition and the voting happen accurately for anything other than toy problems
The approach in the paper specifically addresses the case where an LLM can usually solve a task when it requires few steps, but fails for the same kind of task with more steps because it randomly gets a step in the middle wrong and then derails. It can't do anything for tasks that the LLM can't solve even when there's just a few steps.
In other words, it compensates for random error, not systematic error.
Obviously, not the best plot to use according to Data Visualization theory and common practice, but I think it candidly conveys the point anyway.
As someone else points, the data is the worrying aspect, as it points towards state-of-the-art models not being able of making more than 0 consecutive steps without errors.
I was just thinking "these guys will talk about this graph for the rest of their lives", it's the best graph you could ever hope to put into a paper. Loved it.
In case you want to know what’s going on in the left side of that chart, they gave a log scale in appendix a. I was thinking it was silly to not just use that version on the top, but I guess log scales make big differences ’feel’ smaller.
A log scale is actually appropriate in this context from a first-principles perspective. Per scaling laws (and also general behavior of epsilon-probability of failure multiplied N times), you would generally expect more vs. less effective techniques to have multiplicatively greater or fewer steps until failure, not additively greater/fewer. Figure 1 is comical, but the appendix figure is the more scientifically appropriate one.
Especially since it's a recursive problem so each step is naturally broken up into subtasks. And the algorithm of what subtasks to break it up in to is public. This makes it much easier for it to get down to a case that the LLM can reliable solve.
I guess that the subtask decomposition of many (sub)problems is known and in the training distribution. How many real-world problems are resistant to divide-and-conquer? Presumably most/all of the unsolved mathematics conjectures. What else?
Hmm...
The key is to successfully decompose a big, hard problem into easier atomic sub-problems. However, the decomposition process itself is difficult, and this paper is not about that. They decompose a task using a human-written prompt.
Right, it’s kind of like solving systems of linear equations. Some can be solved just by reordering, but most need you to handle all the constraints at once.
This has seemed to me to be the natural next step to turn LLMs into more deterministic tools. Pushing the frontier is nice, but I think LLMs have a whole different gear when they are able to self-decompose in a reliable way. Most of my success creating reusable LLM products came from determining where requirements/outputs need to be "hard" vs. "soft".
state = init_state()
while state is not complete:
state = LLM("You are a helpful assistant. The rules and format of the game is [...]. The correct strategy to use at each step is [...]. The current state is [...]. Output the state after making the next move")
The problem is how to even define a task using the English language and make sure there is enough entropy to infer the detailed intent. For it to be later split into zillions of small steps which can be executed over time by an LLM.
one issue I see is when steps in a plan depend on one another, when you cannot know all the next steps exactly before seeing the results of the previous ones, when you may have to backtrack sometimes
No mention of MoE. One would think this is a logical evolution of that but not a mention (that I saw). Its own rubric for the task, Towers of Hanoi, was admittedly weak.
LLM papers are starting to look like the last decade of JS frameworks and Tools. Only with less code and more academics, and thats disappointing, because I think a lack of pragmatism and grounding is now holding the field back...
> However, this result made it clear that the reliability of state-of-the-art LLMs is fundamentally limited: If they need to complete every step correctly in order to solve a task, after a certain number of steps they will almost surely fail as a result of an underlying propensity to make errors, even when the answer should be obvious. While an error rate of 1-in-1,000 seems low, and would be great on a traditional LLM benchmark, on a task that requires successful execution of thousands of steps in a row, such a system results in inevitable failure.
What a relief to see an obvious problem actually acknowledged. I can't even guess how many times I've been shouted down about this exact topic in the reasoning debates on HN, or seen papers just kind of glossing over it as if it were a non-issue.
The next really natural question is.. if you're committed to decomposing a problem into tons of microsteps and voting.. why aren't we just embracing hybrid symbolic systems? The decomposition step kind of implies you're in a problem domain where variables separate out somewhat cleanly and that this should be doable. As far as I can tell the "voting" discussed in the paper is about candidate outputs, i.e. solutions to subproblems? If you switch to hybrid symbolic systems, then you can vote on candidate inputs to solvers and at least be damned sure that their output is always correct.
Also the success of chain-of-code compared with chain-of-thought approaches could actually imply that having no real solver is maybe not the obstacle you'd expect! Maybe you can invent a semiformal logic just in time that appears to be expressive enough to encapsulate the problem domain, and have the LLM emulate a nonexistent solver. If the error rate with this sort of approach is still too high, then at least you know concretely what solver or formal-language you need to implement in order to improve.
My own attempt at "chain-of-code with a Prolog DSL": https://news.ycombinator.com/item?id=45937480. Similarly to CodeAct the idea there is to turn natural language task descriptions into small programs. Some program steps are directly executed, some are handed over to an LLM. I haven't run any benchmarks yet, but there should be some classes of tasks where such an approach is more reliable than a "traditional" LLM/tool-calling loop.
Prolog seemed like a natural choice for this (at least to me :-), since it's a relatively simple language that makes it easy to build meta-interpreters and allows for a fairly concise task/workflow representations.
Nice, I do like the direction. A prolog dialect does seem like a natural choice if we must pick only one kind of intermediate representation, but ideally there could be multiple. For example, I saw your "legal reasoning" example.. did you know about https://catala-lang.org/ ? I think I'd like to see an LLM experiment that only outputs formal specifications, but still supports multiple targets (say prolog, z3, storm, prism, alloy and what have you). After you can output these things you can use them in chain-of-code.
Anyway the basic point being.. it is no wonder LLM reasoning abilities suck when we have no decent intermediate representation for "thinking" in terms of set/probability primitives. And it is no wonder LLMs suck at larger code-gen tasks when we have no decent intermediate representation for "thinking" in terms of abstract specifications. The obsession with natural-language inputs/intermediates has been a surprise to me. LLMs are compilers, and we need to walk with various spec -> spec compilers first so that we can run with spec -> code compilers
Thank you, https://catala-lang.org/ looks very interesting. I've experimented a lot with LLMs producing formal representations of facts and rules. What I've observed is that the resulting systems usually lose a lot of the original generalization capabilities offered by the current generation of LLMs (Finetuning may help in this case, but is often impractical due to missing training data). Together with the usual closed world assumption in e.g. Prolog, this leads to imho overly restrictive applications. So the approach I am taking is to allow the LLM to generate Prolog code that may contain predicates which are interpreted by an LLM.
So one could e.g. have
is_a(dog, animal). is_a(Item, Category) :- @("This predicate should be true if 'Item' is in the category 'Category'").
In this example, evaluation of the is_a predicate would first try to apply the first rule and if that fails fallback on to the second rule branch which goes into the LLM. That way the system as a whole does not always fail, if the formal knowledge representation is incomplete.
I've also been thinking about the Spec->Spec compilation use case. So the original Spec could be turned into something like:
spec :- setup_env, create_scaffold, add_datamodel,...
I am honestly not sure where such an approach might ultimately be most valuable. "Anything-tools" like LLMs make it surprisingly hard to focus on an individual use case.
> While an error rate of 1-in-1,000 seems low, [...], on a task that requires successful execution of thousands of steps in a row, such a system results in inevitable failure.
This is also why (edit: non-LIDAR) FSD cars are an illusion.
FSD isn't, and never was, a sensor problem. It's an AI problem. Always was. Always will be.
Humans drive around with two mid-tier cameras on a pivot mount. Which means that any sufficiently advanced AI can do the same.
When a FSD car gets into an avoidable collision, you dump the blackbox data, and what do you see? You see that the cameras had it. All the information the car needed to avoid a collision was right there in the visual stream. The car had every bit of information it needed to make the right call, and didn't make the right call.
You can acknowledge that, and focus on building better AIs. Or you can neglect AI altogether, and have a car with 6 LIDARs drag a pedestrian, because it had all the sensor coverage but zero object permanence.
> Or you can neglect AI altogether, and have a car with 6 LIDARs drag a pedestrian, because it had all the sensor coverage but zero object permanence.
False dichotomy, much?
No, it's not false dichotomy. It's Cruise.
I am annoyed to no end by all the LIDAR wankery - while in practice, LIDAR systems don't provide much of an advantage over camera only systems, and consistently hit the same limitations on the AI side of things.
Nonetheless, there is no shortage of people who, for some reason, think that LIDAR is somehow a silver bullet that solves literally everything.
So why do you think the only reliable FSD car out there is built around an expensive LIDAR system?
LIDAR may not solve everything, but the point is that it allows for greater safety margins. All the non-safety-critical parts can be done with AI, yes.
Reliable?
> Humans drive around with two mid-tier cameras on a pivot mount. Which means that any sufficiently advanced AI can do the same.
Yes, if we can get their error rate below 0.000001%. Until we get there, sensors + old school computer-vision provide safety.
What makes you think that "sensors + old school computer-vision" gives you an error rate better than "completely fucked"?
In simple terms, the LIDAR sensor will allow you to do "If object at X, don't go to X". But obviously, you need more than that. Old school Kalman filters for object tracking etc.
Raw dog LIDAR and "old school kalman filters" don't give you anywhere near good enough performance.
Want to know how poor performance looks like in practice? Like Tesla phantom braking but ten times worse. And if you dial it down to avoid false positives, then it stops exerting any control over AI, and you're back to getting your AI to work well.
It's interesting that this is a level of tech reductionism that is really common right now and that it's not more openly challenged by engineers. Tough intractable problem? AI. How close are we? Soon! How soon? I don't know, I don't work on that problem.
Of course FSD is solvable with advanced AI and the same applies to all other problems but we don't yet have this level of AI and we don't know how far away we are from reaching it.
Companies that bet on assistive AI solutions (i.e. more sensors to plug AI gaps) will win and have the best chance at eventually reaching the level of AI where additional sensors are no longer needed.
Companies that go all-in on perfect AI have a very, very high chance of failure, not because they're not smart enough, or not driven enough, or are capital constrained, but because they dont fully understand the scope of the problem.
Also worth noting they are heavily incentivised to pump the AI bubble for existential reasons, and so their AI progress forecasts are not trustworthy.
"Reductionism" is right. If you could just always "plug the gaps with more sensors", then the car with 900 cameras and 400 LIDARs would have reached L4 autonomy back in year 2010.
It doesn't work like that. No amount of sensors can salvage piss poor driving AI. The gains from more and better sensors bottom out pretty fast. You completely saturate your ability to sink and fuse sensor data long before your driving actually gets good.
Waymo has driven a hundred million miles and is far safer than human drivers.
Yes because they use non-AI techniques (LIDAR) to prevent them from bumping into things.
I should have said non-LIDAR in my comment, yes.
LIDAR doesn’t stop them bumping into anything. LIDAR is a sensor, it doesn’t recognise anything or make decisions about steering, acceleration, or braking.
You need more than the sensor, obviously. The point is that you don't need any AI to make a system this way that is substantially safer than a system based only on camera feeds and AI.
Do you think Waymo using LIDAR means that Waymo aren’t using AI? Waymo are using AI.
That's not what I'm saying at all. Waymo uses LIDAR to ensure safety. They use AI for most of the rest.
> > While an error rate of 1-in-1,000 seems low, [...], on a task that requires successful execution of thousands of steps in a row, such a system results in inevitable failure.
> This is also why (edit: non-LIDAR) FSD cars are an illusion.
In this scenario, Waymo’s AI is executing thousands of steps in a row. The fact that it uses LIDAR for sensing doesn’t change that. It’s still AI driving you around no matter what its eyes are made of.
Waymo is a counterexample to the point you were making and their use of LIDAR doesn’t change that.
No because safety is guaranteed by the LIDAR and navigation is done by GPS+classical algorithms. Mistakes made by the AI can be overcome by those two non-AI approaches + reiterating the AI-based steps.
LIDAR cannot possibly guarantee safety. It is a sensor.
That's like saying an algorithm cannot guarantee safety because it is not an actuator.
An algorithm controls the actuator. The sensor does not control the algorithms or the actuators.
Look, it is a hell of a lot safer to use an input that does not hallucinate objects than an input that does.
I think you're still in the edit window, FYI. (At least for a few more minutes.)
Of course, we struggle to get humans to low error rates on large number of steps in sequence too, to the point where we devote vast amount of resources to teaching discipline, using checklists, doing audits and reviews to coax reliability out of an unreliable process.
So nobody should be surprised that this also applies to LLMs.
The issue is when people assumes that a zero failure rate, or even close to zero, is necessary for utility, even though we don't need that from humans for humans to be useful for complex tasks.
For a whole lot of tasks, the acceptable error rate boils down to how costly it is to work around, and that is a function of the error rate, consequence of an error that slips past, and the cost of a "reliable enough" detector to let us mitigate to whatever extent is cost effective by using one or more detection steps.
For a lot of uses, voting or putting the AI in a loop, produces a good enough results cheap enough. For some it will require models with lower error rates first.
For some applications, sure, maybe solvers will be part of that, or in the mix. As will a lot of other tools. E.g. Claude likes to try to bisect when I ask it to fix a parser problem, and Claude is really bad at doing sensible bisection, so I had it write a dumb little bisection tool instead, and told it steps to solve this type of problem that includes using that tool. So when we can have planning steps output "microsteps" that we can automate with more deterministic tools, then we absolutely should.
Heck, the models themselves "likes" to write tools to automate if you give them long lists of tedious little tasks to do, to the point it's effort to make them not do it even when they have to write the tools themselves.
> The issue is when people assumes that a zero failure rate, or even close to zero, is necessary for utility, even though we don't need that from humans for humans to be useful for complex tasks.
This argument doesn't carry because it is beside the point. Human vs. LLM utility parity isn't a sensible stop-goal for improvement. New technology isn't adopted for its legacy parity. Nor are there any specific technical barriers around human parity.
Fewer mistakes than humans, by definition, delivers unique value. People also want to spin up LLMs to handle tasks at scale in ways humans never could, where human level mistakes would be unacceptable.
So we very much do need LLMs (or whatever we call them tomorrow) to operate with lower error bars than humans. It is a reasonable demand. Lots of applications are waiting.
Given that demand, the value of avoiding any mistake, and the many people working on it, error rates will keep falling indefinitely.
> This argument doesn't carry because it is beside the point. Human vs. LLM utility parity isn't a sensible stop-goal for improvement. New technology isn't adopted for its legacy parity. Nor are there any specific technical barriers around human parity.
This is just utter nonsense. New technology is sometimes adopted because it is better, but just as often adopted even when the quality is strictly worse if it is cheaper.
But apart from that you appear to arguing against a point I never made, so it's not clear to me what the point of your response is.
> Fewer mistakes than humans, by definition, delivers unique value.
Yes, but that is entirely irrelevant to the argument I made.
> Given that demand, the value of avoiding any mistake, and the many people working on it, error rates will keep falling indefinitely.
And this is also entirely irrelevant to the point I made, and not something I've ever argued against.
> when the quality is strictly worse if it is cheaper
True. I stand corrected.
For a comprehensive rebuttal to this point of view, you may be interested in the works of W. Edwards Deming.
“No one knows the cost of a defective product - don't tell me you do. You know the cost of replacing it, but not the cost of a dissatisfied customer.” -Deming
No, I would not, as this argument is entirely irrelevant and doesn't address what I said.
> we struggle to get humans to low error rates on large number of steps in sequence too
Who said anything about AI vs humans? The contest in this context would be AI vs classical deterministic code, algorithms, solvers
> how costly it is to work around .. a function of the error rate, consequence of an error that slips past, the cost of a "reliable enough" detector.. produces a good enough results cheap enough.
I mean, you're right, but only sort of. Someone can use this same argument to justify the assertion that bogosort is really the pinnacle of engineering excellence. How would you respond?
> Who said anything about AI vs humans?
I did, because it is a relevant comparison.
> The contest in this context would be AI vs classical deterministic code, algorithms, solvers
No, it is not. In cases where we know how to solve things that way, we probably should, on the assumption that if they can deliver good enough results they are likely cheaper.
Those are not the things we generally are trying to use LLMs for.
> I mean, you're right, but only sort of. Someone can use this same argument to justify the assertion that bogosort is really the pinnacle of engineering excellence. How would you respond?
That it is an obivously specious argument, because we have clearly lower cost sort algorithms, and so no, you can't use this same argument to justify that assertion.
Nice!
Briefly, the idea is recursively to decompose tasks into the simplest possible steps, recursively call (relatively small) LLMs as agents to execute one step at a time, and using a clever voting scheme to choose how to execute each step. The authors use this technique to get a relatively small LLM to solve Towers of Hanoi with 20 rings (1M steps). All of it using natural language.
The most obvious question is whether other tasks, more interesting -- less "rote" -- than Towers of Hanoi, can similarly be recursively decomposed into simple steps. I'm not sure that's always possible.
This works because a problem could be broken down to a prompt which rarely hallucinates.
Most real world prompts can't be reduced to something so consistent and reliable.
Their key finding was that the number of votes grows linearly with number of prompts you are trying to chain.
However the issue is that the number of votes you need will grow exponentially with hallucination rate.
> into the simplest possible steps, recursively call (relatively small) LLMs as agents to execute one step at a time, and using a clever voting scheme to choose how to execute each step.
It's like humans! Everything old is new again :)
Why not? That's basically how NASA manages large projects.
One issue I often run into with this stuff is the tightly coupled nature of things in the real world. I’ll fashion an example:
Let’s say you break a job down into 3 tasks: A, B and C. Doing one of those tasks is too much for an LLM to accomplish in one turn (this is something you learn intuitively through experience), but an LLM could break each task into 3 subtasks. So you do that, and start by having the LLM break task A into subtasks A1, A2 and A3. And B into B1, B2 and B3. But when you break down task C, the LLM (which needs to start with a fresh context each time since each “breakdown” uses 60-70% of the context) doesn’t know the details of task A, and thus writes a prompt for C1 that is incompatible with “the world where A1 has been completed”.
This sort of “tunnel vision” is currently an issue with scaling 2025 agents. As useful context lengths get longer it’ll get easier, but figuring out how to pack exactly the right info into a context is tough, especially when the tool you’d reach for to automate it (LLMs) are the same tool that suffers from these context limitations.
None of this means big things aren’t possible, just that the fussyness of these systems increases with the size of the task, and that fussyness leads to more requirements of “human review” in the process.
I've been experimenting with this with a custom /plan slash command for claude code, available here: https://github.com/atomCAD/agents
Planning is definitely still something that requires a human in the loop, but I have been able to avoid the problem you are describing. It does require some trickery (not yet represented in the /plan command) when the overall plan exceeds reasonable context window size (~20k tokens). You basically have to start having the AI consider combinatorially many batches of the plan compared with each other, to discover and correct these dependency issues.
>the LLM (which needs to start with a fresh context each time since each “breakdown” uses 60-70% of the context) doesn’t know the details of task A, and thus writes a prompt for C1 that is incompatible with “the world where A1 has been completed”.
Can't that be solved with sub agents? The main agents oversees on combines code and calls sub agents for each tasks.
Reasoning by analogy is great for intuition, but doesn’t guarantee real results hold. Consider “voltage is like water pressure in pipes, so if there’s a cut in my wire’s insulation, the device won’t get enough voltage” — clearly this is not true, even though it relies on an analogy that’s generally useful.
I really like that analogy, thank you for it. Also applies to “it’s overvoltage, so I just need to poke a little hole in it to let the excess bleed out”…
That one can work, briefly, depending on how conductive your tool is.
If air was highly conductive that analogy would totally hold.
"If there’s a cut in my wire’s insulation, the device won’t get enough voltage" doesn't follow from: "voltage is like water pressure in pipes"
So I don't really get your point.
> "If there’s a cut in my wire’s insulation, the device won’t get enough voltage" doesn't follow from: "voltage is like water pressure in pipes"
I absolutely agree! In the same way, "an LLM can solve complex problems if it breaks them into subtasks" doesn't follow from "NASA breaks large projects into smaller parts"
Well, corona losses are a thing, after all.
This is a really good analogy because the complex intersections between multiple groups independently working and trying to collaborate together into a collaborative hierarchy towards one large goal was one of the things that hid a lot of the problems that led to the Challenger disaster, according to Feynmen.
I'm pretty sure the problem with the shuttle was that it had too many (possibly conflicting) goals instead of one large goal.
It's manned, even though most launches probably could be done without crew. The deadly Challenger launch was risking human crew for something as mundane as launching two satellites into space.
Because it's manned, it has to be able to land at airports, because retrieving astronauts at sea is an unreasonable complication for launching a satellite. Damage to the wings will cause loss of the entire aircraft, something that is unlikely to happen to a capsule.
Because it is a horizontal landing system, the aerodynamics favor putting the shuttle on the same level as the external fuel tank, which exposes the wing to debris from the top of the external fuel tank. If you try building a vertical shuttle in KSP, you will notice that the wings give you too much control authority during launch. Fins are best placed near the bottom of the rocket.
It's reusable, which means wear and tear can secretly accumulate without you noticing. This significantly increases the design requirements for the critical components, like the SRB that had a poor "tang" design, which, as it turns out, was definitively not fit for reuse.
It is also what made the space shuttle possible in the first place, so I'd be careful about generalizing too much from that observation.
The space shuttle’s design was also deeply flawed to the point it failed to do the core objective, significantly lowering costs. Instead the core mission was sacrificed to meet some arbitrary design goals such as being able to de-orbit heavy objects.
That’s the core issue with decomposition of tasks, you aren’t communicating back up the chain and finding globally optimal solutions unless the task is simple enough to be completely understood.
"basically" is doing a lot of work in this sentence.
IBM tried that with CMM (capability maturity model), it didn't work, the problem is NASA knows what they're building, rockets and satellites don't have any grey areas and everything is specified. Other things are less well defined, and the people specifying aren't rocket scientists.
I could imagine that even a small task at NASA might involve more knowledge and logic than the smallest task for a Hanoi's tower problem.
Depends on what is considered as small enough for the LLM to be resolved with a high confidence.
NASA has done a lot of amazing things but I wouldn’t bet on them winning a Super Bowl.
They'd have a 50% chance of winning one on Mars, since it would just be NASA vs China
Every year NASA has a 50% chance of winning the Superbowl- even on Earth!
Either they win or don't. /s
Its LLMs all the way down :-)
This can't be scaled to more generalised tasks. If you solve that then you've solved the hallucination issue.
> All of it using natural language.
Combining this with those approaches that recursively reason in latent space would be interesting.
It seems like this could be implemented by any harness.
> The approach relies on an extreme decomposition of a task into subtasks, each of which can be tackled by focused microagents. The high level of modularity resulting from the decomposition allows error correction to be applied at each step through an efficient multi-agent voting scheme.
Big if that the decomposition and the voting happen accurately for anything other than toy problems
The approach in the paper specifically addresses the case where an LLM can usually solve a task when it requires few steps, but fails for the same kind of task with more steps because it randomly gets a step in the middle wrong and then derails. It can't do anything for tasks that the LLM can't solve even when there's just a few steps.
In other words, it compensates for random error, not systematic error.
Worth opening the pdf just for the graph on page 1.
A striking example of how not to present data. If the Cognizant AI team is here: please can you fix it in the next version of the paper?
Obviously, not the best plot to use according to Data Visualization theory and common practice, but I think it candidly conveys the point anyway.
As someone else points, the data is the worrying aspect, as it points towards state-of-the-art models not being able of making more than 0 consecutive steps without errors.
I think it's a brilliant example of how to use data to make a point.
https://xkcd.com/1162/
Except on figure 1 they're all at 0, making it look like the authors didn't know how to use the models or deliberately made them do nothing.
I think it just looks that way because they used a linear x axis for comedic effect.
http://www.vibechart.net
I was just thinking "these guys will talk about this graph for the rest of their lives", it's the best graph you could ever hope to put into a paper. Loved it.
In case you want to know what’s going on in the left side of that chart, they gave a log scale in appendix a. I was thinking it was silly to not just use that version on the top, but I guess log scales make big differences ’feel’ smaller.
A log scale is actually appropriate in this context from a first-principles perspective. Per scaling laws (and also general behavior of epsilon-probability of failure multiplied N times), you would generally expect more vs. less effective techniques to have multiplicatively greater or fewer steps until failure, not additively greater/fewer. Figure 1 is comical, but the appendix figure is the more scientifically appropriate one.
At that rate, they might as well have gone one step further and made the x axis exponential scale to make it feel even bigger.
The dashed lines on top of the data points and labels is making me wince
Really seems like the reason logarithmic scales were invented..
I dunno, even though the authors address its use, making the task Tower of Hanoi doesn't meet the excitement of the title.
Especially since it's a recursive problem so each step is naturally broken up into subtasks. And the algorithm of what subtasks to break it up in to is public. This makes it much easier for it to get down to a case that the LLM can reliable solve.
I guess that the subtask decomposition of many (sub)problems is known and in the training distribution. How many real-world problems are resistant to divide-and-conquer? Presumably most/all of the unsolved mathematics conjectures. What else?
And yet the reverse paper was posted ad nauseam, covered by every news slop site, and overblown with really negative takes.
Hmm... The key is to successfully decompose a big, hard problem into easier atomic sub-problems. However, the decomposition process itself is difficult, and this paper is not about that. They decompose a task using a human-written prompt.
I have ADHD and the same approach works for me. (In fact, most days it is essential!)
do you have an algorithm for breaking down, organizing, and scheduling the small tasks, though? can it also be broken down?
Some real life problems cannot be decomposed or cannot be decomposed with ease by an LLM.
Also, if we decompose a big task into many tasks, some might be solved in an incompatible way with the rest of the tasks and you can not combine them.
Right, it’s kind of like solving systems of linear equations. Some can be solved just by reordering, but most need you to handle all the constraints at once.
This has seemed to me to be the natural next step to turn LLMs into more deterministic tools. Pushing the frontier is nice, but I think LLMs have a whole different gear when they are able to self-decompose in a reliable way. Most of my success creating reusable LLM products came from determining where requirements/outputs need to be "hard" vs. "soft".
Here is the pseudocode of MAKER:
The problem is how to even define a task using the English language and make sure there is enough entropy to infer the detailed intent. For it to be later split into zillions of small steps which can be executed over time by an LLM.
In English, that's hard, but there are programming languages ... specialized in breaking a complex task down for computers to understand.
...
one issue I see is when steps in a plan depend on one another, when you cannot know all the next steps exactly before seeing the results of the previous ones, when you may have to backtrack sometimes
This is actually good insight and worded in a simple way that clicked in my brain, thanks!
And you can decompose the proof of Fermat's last theorem into logical combinators.
The meat is in decomposing the difficult problem into steps
On the surface this is an interesting concept...
The paper however, meh...
No mention of MoE. One would think this is a logical evolution of that but not a mention (that I saw). Its own rubric for the task, Towers of Hanoi, was admittedly weak.
LLM papers are starting to look like the last decade of JS frameworks and Tools. Only with less code and more academics, and thats disappointing, because I think a lack of pragmatism and grounding is now holding the field back...