Today we're releasing MolmoWeb, an open source agent that can navigate + complete tasks in a browser on your behalf.
Built on Molmo 2 in 4B & 8B sizes, it sets a new open-weight SOTA across four major web-agent benchmarks & even surpasses agents built on proprietary models. 🧵

MolmoWeb works by looking at the same screen you do.
Given a task and a live webpage, it views the screenshot, decides what to do next, and takes action—clicking, typing, scrolling, switching tabs, or returning information back to you.
Given a task and a live webpage, it views the screenshot, decides what to do next, and takes action—clicking, typing, scrolling, switching tabs, or returning information back to you.

MolmoWeb can handle a wide range of everyday tasks, including navigating websites, filling out forms, searching and filtering product listings, and finding information—all without needing specialized APIs for each site.
youtube.com/watch?v=rzkBE8…
youtube.com/watch?v=rzkBE8…
MolmoWeb was trained on a mix of datasets including:
◎ Trajectories generated by an AxTree-based LLM agent
◎ Human demonstrations collected via a custom Chrome extension
◎ Data that teaches the model to read & interpret what's on screen
◎ Trajectories generated by an AxTree-based LLM agent
◎ Human demonstrations collected via a custom Chrome extension
◎ Data that teaches the model to read & interpret what's on screen

MolmoWeb outperforms all open-weight models on every benchmark we tested, and even surpasses visual agents built on much larger models like GPT-4o-based SoM Agents.
It also beats OpenAI CUA on 3 out of 4 benchmarks.
It also beats OpenAI CUA on 3 out of 4 benchmarks.


Performance improves further by scaling compute at inference time.
On both WebVoyager and Online-Mind2Web, MolmoWeb with 4 parallel attempts surpasses the best single-attempt performance of every model we evaluated, including agents powered by GPT-5 and Gemini CU Preview.
On both WebVoyager and Online-Mind2Web, MolmoWeb with 4 parallel attempts surpasses the best single-attempt performance of every model we evaluated, including agents powered by GPT-5 and Gemini CU Preview.
While leading the pack among open models, MolmoWeb has limitations.
It can misread text, lose track after a wrong action, & struggle with vague prompts. For safety reasons, it’s also not trained on tasks with logins/financial transactions. These remain active research areas.
It can misread text, lose track after a wrong action, & struggle with vague prompts. For safety reasons, it’s also not trained on tasks with logins/financial transactions. These remain active research areas.
We're also releasing MolmoWebMix, a dataset for training web agents. It includes 150K+ trajectories:
⁌ 30K+ human trajectories
⁌ 7M GUI grounding examples
⁌ 2.2M screenshot QA examples
Everything needed to inspect, reproduce, & fine-tune MolmoWeb is openly available.
⁌ 30K+ human trajectories
⁌ 7M GUI grounding examples
⁌ 2.2M screenshot QA examples
Everything needed to inspect, reproduce, & fine-tune MolmoWeb is openly available.
The web is the world's largest software platform. Agents that can navigate it reliably could dramatically expand access to information and digital services.
MolmoWeb gives the community a strong open foundation to build on.
MolmoWeb gives the community a strong open foundation to build on.
🤖 Models: huggingface.co/collections/al…
🎮 Demo: molmoweb.allen.ai
📊 Data: huggingface.co/collections/al…
💻 Code: github.com/allenai/molmow…
📝 Blog: allenai.org/blog/molmoweb
🎮 Demo: molmoweb.allen.ai
📊 Data: huggingface.co/collections/al…
💻 Code: github.com/allenai/molmow…
📝 Blog: allenai.org/blog/molmoweb
Generated by Thread Navigator
Press ⌘ + S to quick-export
