Booktest is build based of 2 decade-career in data science. It has been used to support RnD on numerous LLM, ML, NLP, information retrieval and also more traditional software engineering.
It was partly inspired by earlier examples (kudos for Ferenc), but especially real pains with how to assert ML QA with regression testing, transparency and iteration cycle speed.
So, in systems where correctness is fuzzy, evaluation is expensive, and changes have non-local effects, a failing test without diagnostics often raises more questions than it answers. This is a painful combination, if left unsolved.
Booktest is now on its 3rd or 4th iteration of the same idea, and as such it addresses most common needs and problems in this space.
It is a review-driven regression testing approach that captures system behavior as readable artifacts, so humans can see, review, and reason about regressions instead of fighting tooling.
This approach has been used in production for testing ML/NLP systems processing large volumes of data, and we’ve now open-sourced it.
I'm curious whether this matches others’ experience, and how people handle this today.
One of the key techniques is snapshotting the LLM (or any HTTP) request. This means that if the inputs won't change, the LLM will not be called. This will also snapshot /cache LLM verifications steps.
This doesn't only saves costs, but it's main goal was to force determinism and save time. Limited changes may need only the new/changed tests to be rerun with LLM. CI typically don't have LLM API keys and only rerun against snapshots with zero costs and delays
All LLM operations tend to be notoriously slow, and at least on our side: we are often more interested of how our code interacts with the LLM. Having the LLM being fully snapshopshotted does iterating the code delightfully fast.
If you want do sampling, this can be implemented in the test code. Booktest is a bit like pytest in the sense, that the actual testing logic heavylifting is left for the the developer. Lot of LLM test suites are more opinionated, but also more intrusive in that sense
Hmm.. So, I'm correct, you are maintaining QA suites for 10k agents, their prompts, toolboxes and some scenarios.
That sounds like an absolutely massive scale to do QA over. Running the test suite must cost fortune and take ages.
As such, I don't think the data storage is that big problem. E.g. If you have 10-100 requests stored per agent, it's 100k-1M snapshots. Booktest normally stores the states in Git, but there is also some kind of DVC support. If you need to recreate all snapshots regularly, e.g. of you change some system wide properties often, that may became problem or not. Git and DVC can manage quite high scale. The Git PR reviews won't work though with e.g 1M changed files.
Our scale in mono repo was maybe 10k LLM snapshots in several hundred files, which worked technically. Recreating all evaluations was a bit slow (e.g 5-10 minutes) and merges often forced recreation of snapshots and re-reviews. This is ofc massively smaller scale than what you are having. We did use booktest for assistant, but it was just one assistant over dozens of use case flows.
I guess it could be somehow manageable, if you try to avoid test level manual review and only review some aggregared results in the tool, but I cannot really promise anything. It may be a worth a try.
Booktest is build based of 2 decade-career in data science. It has been used to support RnD on numerous LLM, ML, NLP, information retrieval and also more traditional software engineering.
It was partly inspired by earlier examples (kudos for Ferenc), but especially real pains with how to assert ML QA with regression testing, transparency and iteration cycle speed.
So, in systems where correctness is fuzzy, evaluation is expensive, and changes have non-local effects, a failing test without diagnostics often raises more questions than it answers. This is a painful combination, if left unsolved.
Booktest is now on its 3rd or 4th iteration of the same idea, and as such it addresses most common needs and problems in this space.
It is a review-driven regression testing approach that captures system behavior as readable artifacts, so humans can see, review, and reason about regressions instead of fighting tooling.
This approach has been used in production for testing ML/NLP systems processing large volumes of data, and we’ve now open-sourced it.
I'm curious whether this matches others’ experience, and how people handle this today.
[dead]
One of the key techniques is snapshotting the LLM (or any HTTP) request. This means that if the inputs won't change, the LLM will not be called. This will also snapshot /cache LLM verifications steps.
This doesn't only saves costs, but it's main goal was to force determinism and save time. Limited changes may need only the new/changed tests to be rerun with LLM. CI typically don't have LLM API keys and only rerun against snapshots with zero costs and delays
All LLM operations tend to be notoriously slow, and at least on our side: we are often more interested of how our code interacts with the LLM. Having the LLM being fully snapshopshotted does iterating the code delightfully fast.
If you want do sampling, this can be implemented in the test code. Booktest is a bit like pytest in the sense, that the actual testing logic heavylifting is left for the the developer. Lot of LLM test suites are more opinionated, but also more intrusive in that sense
[dead]
Hmm.. So, I'm correct, you are maintaining QA suites for 10k agents, their prompts, toolboxes and some scenarios.
That sounds like an absolutely massive scale to do QA over. Running the test suite must cost fortune and take ages.
As such, I don't think the data storage is that big problem. E.g. If you have 10-100 requests stored per agent, it's 100k-1M snapshots. Booktest normally stores the states in Git, but there is also some kind of DVC support. If you need to recreate all snapshots regularly, e.g. of you change some system wide properties often, that may became problem or not. Git and DVC can manage quite high scale. The Git PR reviews won't work though with e.g 1M changed files.
Our scale in mono repo was maybe 10k LLM snapshots in several hundred files, which worked technically. Recreating all evaluations was a bit slow (e.g 5-10 minutes) and merges often forced recreation of snapshots and re-reviews. This is ofc massively smaller scale than what you are having. We did use booktest for assistant, but it was just one assistant over dozens of use case flows.
I guess it could be somehow manageable, if you try to avoid test level manual review and only review some aggregared results in the tool, but I cannot really promise anything. It may be a worth a try.
[dead]
[dead]
[dead]