We've found that methods like this substantially increase the quality and reliability of coding agent output. The ability to run code in a sandbox, drive an interactive session using a browser or API calls or other apps, and visually confirm output via vision models all adds up to plugging a big hole in the feedback loop for agent modifying a complex codebase.
We've had agents go as far as interactively testing how our product responds in video calls by launching our full stack in a set of docker containers (app, api, db, queues, etc.), all inside a larger sandbox, populating test data, connecting the mock system to a real video call solution like Google meet, and injecting audio and video to test the response. End-to-end, like a real user flow.
It's not perfect yet, but if you are a skeptic on the ability for AI agents to productively modify a complex product, I'd highly encourage you to play with a setup like this before ossifying your conclusions.
Didn't Claude Fable do this? (and I think codex and Claude Code in general)
When Fable was around last week, I was smitten with it. I took an executable file from an old DOS application, told it to port it to the Mac. From that single prompt, it was able to set up a test rig with Dosbox to execute the application after already disassembling and gathering as much info as it can and then continuously refine the output application while testing it against the original file. 15 minutes later it had an 99% identical looking and functioning application running natively on the Mac. Sone final refinements got that to 100%.
we're working on a way for you to expose creds safely into our sandbox. But for now, it's limited to mocks API calls, clicks around the UI, and unit tests.
We've found that methods like this substantially increase the quality and reliability of coding agent output. The ability to run code in a sandbox, drive an interactive session using a browser or API calls or other apps, and visually confirm output via vision models all adds up to plugging a big hole in the feedback loop for agent modifying a complex codebase.
We've had agents go as far as interactively testing how our product responds in video calls by launching our full stack in a set of docker containers (app, api, db, queues, etc.), all inside a larger sandbox, populating test data, connecting the mock system to a real video call solution like Google meet, and injecting audio and video to test the response. End-to-end, like a real user flow.
It's not perfect yet, but if you are a skeptic on the ability for AI agents to productively modify a complex product, I'd highly encourage you to play with a setup like this before ossifying your conclusions.
Didn't Claude Fable do this? (and I think codex and Claude Code in general)
When Fable was around last week, I was smitten with it. I took an executable file from an old DOS application, told it to port it to the Mac. From that single prompt, it was able to set up a test rig with Dosbox to execute the application after already disassembling and gathering as much info as it can and then continuously refine the output application while testing it against the original file. 15 minutes later it had an 99% identical looking and functioning application running natively on the Mac. Sone final refinements got that to 100%.
OOT, but this website design is fabulous, the only thing is the font size just a bit too small. That header sticky effect on mobile is chef’s kiss
How does this work when a projet have many external dependencies, like an S3 bucket, a secret manager, a third party API, etc?
In my experience, you still are left with these annoying parts. (Ie, figuring out how to give appropriate access to your agents)
we're working on a way for you to expose creds safely into our sandbox. But for now, it's limited to mocks API calls, clicks around the UI, and unit tests.
Are the "clicks around the UI" converted into end-to-end tests eventually? e.g. via playwright.
Not yet - but we want to do this. Similarly true for the ephemeral unit tests that greptile writes.
I wonder how long it will take for someone to pwn this?
To be fair, Leetcode/Hackerrank also runs arbitrary code.