So one of the big constraints of browser-use models is that they require a server running your vision language model to handle the images and convert it to actions.
That means if for instance you are a site owner and you want to include a AI widget that lets users control the webpage you are on via AI (i.e. ask the page to fill out this form) you would need a complicated server setup running a VLM.
I decided to build something different. We have had WebGPU and client-side models for a while, so I decided to build a library that does the following:
Essentially this creates a browser-use model that runs entirely in your browser (no servers). There are a couple of libraries that make this possible:
- wllama for instance allows you to run any gguf model, which means easy access to VLA model on HF (I found ShowUi-2b to be the best but I want to try Nvidia LocateAnything)
- snapdom - as mentioned, this renders your webpage to an svg which is then passed to the VLA
After creating the workflow with those libraries, the rest is cake (not).
Some difficulties I had and my solutions for them:
- Snapdom had 1px rendering differences due to the inconsistencies rendering html that used a system font within a foreignObject tag in a svg - the fix it to use fonts from a CDN which provide font metrics for leading values
- Image resizing - you have to do some resizing to fit everything into limited space - this involved many different resizing methodologies
- Accuracy - finding out what increased my accuracy was quite hard at first till I found some evals such as MiniWoB++ (a web interaction test suite)
- Multi-step planning - my half-baked solution is to let the LLM generate the multiple steps, but in order for it to be comprehensive I would need to capture page, generate, capture page, generate, etc in a loop. I haven't done that yet
I am very interested in the client side LLM space so let me know if you have any thoughts!
Link: https://github.com/pdufour/browser-use-wasm
So one of the big constraints of browser-use models is that they require a server running your vision language model to handle the images and convert it to actions.
That means if for instance you are a site owner and you want to include a AI widget that lets users control the webpage you are on via AI (i.e. ask the page to fill out this form) you would need a complicated server setup running a VLM.
I decided to build something different. We have had WebGPU and client-side models for a while, so I decided to build a library that does the following:
[Live page (iframe)] ──► [SnapDOM screenshot] ──► [ShowUI VLA WASM worker] ──► [DOM action at [x, y]]
Essentially this creates a browser-use model that runs entirely in your browser (no servers). There are a couple of libraries that make this possible:
- wllama for instance allows you to run any gguf model, which means easy access to VLA model on HF (I found ShowUi-2b to be the best but I want to try Nvidia LocateAnything)
- snapdom - as mentioned, this renders your webpage to an svg which is then passed to the VLA
After creating the workflow with those libraries, the rest is cake (not).
Some difficulties I had and my solutions for them:
- Snapdom had 1px rendering differences due to the inconsistencies rendering html that used a system font within a foreignObject tag in a svg - the fix it to use fonts from a CDN which provide font metrics for leading values
- Image resizing - you have to do some resizing to fit everything into limited space - this involved many different resizing methodologies
- Accuracy - finding out what increased my accuracy was quite hard at first till I found some evals such as MiniWoB++ (a web interaction test suite)
- Multi-step planning - my half-baked solution is to let the LLM generate the multiple steps, but in order for it to be comprehensive I would need to capture page, generate, capture page, generate, etc in a loop. I haven't done that yet
I am very interested in the client side LLM space so let me know if you have any thoughts!