- We launched an AI bug scanner 6 months ago. We wanted to know what the bugs that get fixed have in common.
- We clustered 1,000 bugs from 99 codebases and grouped them by failure mechanism.
- 21 recurring mechanisms cover 70% of the bugs. These same mistakes show up again and again in completely unrelated products. We seem to be writing similar bugs over and over again with agents.
- A common denominator is silence. Bugs that make it to production today are not easily ‘legible’. They don’t crash, they pass CI, they look fine. You don’t see them unless you look for them.
If you’d like to know anything else about these bugs, let us know. Happy to analyze the dataset further.
interesting data and results. it kind of looks like they are good at coding plainly no buffer overflows or such things, but as noted application logic is the trick. Things that are difficult for humans (authentication, paralelism) seems also tricky for them.
Id wonder if they are badly putting together the logic from good instructions or that the prompting was missing pieces and it followed correctly but provided some broken code due to missing requirements / details.
Good question. We don't collect the 'what lead to this code' data, such as agent traces, so I can't say for sure.
From my own experience, I would attribute most of bugs that I write to the agent not considering one of the many constraints it should consider. This could be it assuming a different shape of data, or not browsing the codebase thoroughly enough at some point during its work and missing a parallel, or it not testing an edge case because it didn't think of it, and so on.
So, I would place this more under the latter, 'missing requirements / details', bucket. But its not clear to me that I should be giving the requirements as a prompt. Similarly to a good engineer, an ideal agent would spend a good amount of time understanding all the requirements before completing work. This could include browsing the codebase more extensively, running more tests, asking me for details it cannot find, and so on.
Author here. In short:
- We launched an AI bug scanner 6 months ago. We wanted to know what the bugs that get fixed have in common.
- We clustered 1,000 bugs from 99 codebases and grouped them by failure mechanism.
- 21 recurring mechanisms cover 70% of the bugs. These same mistakes show up again and again in completely unrelated products. We seem to be writing similar bugs over and over again with agents.
- A common denominator is silence. Bugs that make it to production today are not easily ‘legible’. They don’t crash, they pass CI, they look fine. You don’t see them unless you look for them.
If you’d like to know anything else about these bugs, let us know. Happy to analyze the dataset further.
interesting data and results. it kind of looks like they are good at coding plainly no buffer overflows or such things, but as noted application logic is the trick. Things that are difficult for humans (authentication, paralelism) seems also tricky for them.
Id wonder if they are badly putting together the logic from good instructions or that the prompting was missing pieces and it followed correctly but provided some broken code due to missing requirements / details.
Good question. We don't collect the 'what lead to this code' data, such as agent traces, so I can't say for sure.
From my own experience, I would attribute most of bugs that I write to the agent not considering one of the many constraints it should consider. This could be it assuming a different shape of data, or not browsing the codebase thoroughly enough at some point during its work and missing a parallel, or it not testing an edge case because it didn't think of it, and so on.
So, I would place this more under the latter, 'missing requirements / details', bucket. But its not clear to me that I should be giving the requirements as a prompt. Similarly to a good engineer, an ideal agent would spend a good amount of time understanding all the requirements before completing work. This could include browsing the codebase more extensively, running more tests, asking me for details it cannot find, and so on.