I think open-ended simulation for agents will be a key component for training and planning. Similar as human dreams simulate different scenarios in our head. Biggest challenge will be simulating more abstract and complex systems.
Few months ago I did experiment with an open-ended world simulation for AI agent, where the simulated world was progressively building itself based on each of agent actions in open-ended manner. The idea was to give an agent infinite possibility regarding tool calling, where the tool call would be approved by the adjudicator, and the world state would change. The key issues with the PoC were:
- World decoherence (tried to solve that with a poor graph implementation)
- World flatness - high abstraction did not account for small events that would compound in real world
- Start with empty context was real issue to get the agent to explore the world
Anyways the project came to be really funny when you watched agent struggling in desperation to perform real world actions which would be impossible in real world. Main observation was that when presented agent with current action budget, it modulated the creativity and how desperate its actions were.
I agree; after running out of data on the internet, and humans being too slow to generate data, simulation is the only frontier left for improving things (training, datasets, reasoning). And it's probably the most ethical one too.
If nothing else I'm glad to see "world models" that are actually modeling some kind of worlds, instead of the term being applied as a hype layer for video/splats diffusion.
Physical simulations seem like next step, but how do you simulate dynamics in complex systems im not sure. Stock market is a good example with many trying to simulate that, but at the end you have to make some tradeoffs in terms of abstraction level you are simulating.
For social backed simulations i guess some kind of grounding will be needed based on real examples, but then the out of distribution cases will need an other solution. As rate of changes in our civilization increases, the out of distribution cases will be more and more prominent.
Is there much evidence we use dreams to pre-emptively simulate scenarios?
Dreaming seems much more likely to be neurological tidying and emotional reprocessing. Helpful for identifying and surfacing long term subconscious needs but not for planning.
My dreams would be precisely useless for making plans from, unless those plans were to involve being caught in public wrapped only in a towel. And even then, I'm not sure they'd be particularly helpful.
I agree, for me it dreaming was always reprocessing. The resimulation of scenarios part i mentioned can be over-assumption and it might be wrong. One thing i noticed is that sometimes i reprocess motoric movements after martial arts lessons, that was my main clue.
Right — I mean, I think it's interesting what "dream" means colloquially.
Like, we "dream up" things, or we "have dreams" (underspecified broad ideals for our best life etc.)
I do wonder if sometimes reprocessing dreams has helped me have a better response for something when it reoccurs — like, how to better respond to being slighted or abused or sometimes even complimented.
But I don't know if those could be said to be "plans" on any level. It's a kind of training, though.
Dreaming does help you train for grief and loss, I think.
And sometimes for me it has encapsulated the wisdom or reassurance of someone I have lost; my father appears to be quite involved in my recovery from burnout and my imagining a better life for myself and he died several years ago.
There’s an hypothesis that states we dream so we don’t lose visual processing neural connections. Similar to what happens in blind people: visual processing neurons are recruited to other sensory tasks due to lack of stimulation. My ed. guess is that dreaming probably serves multiple purposes
There was an early visual neural network demonstration — strong feeling it was called "Yorick" or somesuch as it was built in a plaster skull — that had a square grid of red LEDs to show its output state as a simple picture; when its camera was unplugged the neural network appeared to "dream" in the sense that things it had seen would flicker and swirl in the output.
I saw this in a video in the early 90s and cannot remember where.
Thanks, interesting. Two years ago I wrote a text adventure game that used a LLM model. The system was very simple, but still was interesting. A friend of mine, Ben Goertzel, has been interesting in games/VR for a long while.
I wrote a book on the subject, but now really old material: AI Agents in Virtual Reality Worlds — J. Wiley, 1996
I'm a huge fan of Ben. Have been tracking the OpenCog initiative for some itmem and think that moving concepts from latentspace to atomspace is best way to efficiently merge our current digital information infrastructure with LLM knowledge.
Regarding your book, I'm shocked that 'AI Agent' concept predates me. Have not read your book, but i think it would be interesting to compare your perspective with today's building blocks which were not available for that time.
I understand what the model is doing. I am struggling to understand where this is going to fit in a workflow. I understand a big gap is that any LLM based ai agent isn't aware of the consequences of its actions because it barely understands the future state its actions will have, hence this model that can.
So, is this like a bolt on where you have an agent powered by an LLM, then the world model reviews the action it wants to take, and the agent confirms this is the intention? Like is this to augment an existing agent with additional capabilities?
This might be pretty big. One of my biggest frustrations with smaller models (especially MoE) is their failure to track workflow state at a high level. I'm constantly reminding them what we decided on or asking them to revisit, and reminding them eats context.
Seems like this might make that a lot less painful. And if not off the bat, with some minimal tuning or even just good prompting.
I'm a fan of this direction. For me the most interesting use case for these world models isn't even training, it's verification. If this thing or some idealized version of it can actually reliably simulate state transitions, could you use it to verify an agent's execution path against hard constraints and replace/eclipse LLMs-as-a-judge?
Well if you can do this then you don't delegate execution path derivation to the agent. The benefit is a predictable coherent world state where you understand the impact of { current state } x { action } without having to enumerate that huge cartesian product.
Give it a day or two and the 'unsloth' people will probably publish a Q6 and Q8 (maybe Q8XL?) quantization in GGUF format for llama-server and other users.
A regular LLM acts as a "policy," mapping a current state to a specific action (states → actions). Their new LLM acts as a "world model," mapping a current state and a chosen action to a predicted future state ((states, actions) → subsequent states). Instead of deciding "what to do," its explicit objective is to predict the exact environment observation that will result from the interaction history and the agent's current action.
I assumed at first that it was trained on synthetic data, but they actually went and deployed real physical hosts and virtual machines (e.g. Ubuntu, macOS, and Android) and browsers. They ran agentic systems on these continuously and recorded the actual, real-world interactions.
So it's an LLM that infers next state, or outcome,as structured data e.g. literal HTML code, UI view hierarchies, or accessibility trees.
So, if I'm reading this correctly, whereas a regular LLM would, given a prompt to edit a file, infer a sed call, this "world" model infers the resulting contents of the file.
Here's the description of the world model prompt for the web domain: "A precise GUI state simulator — given the current screen (as HTML) and a user action, predicts the exact next screen as a complete, self-contained HTML document." (You can click the world model prompt box to expand it and see the full prompt.)
So the world model generates the current state (an html document), an agent tells it what action it wants to perform, the world model generates the next state (another html document).
The other domains are similar, but w/ domain-specific nuance.
Same thing, but qwen has decided to rebrand certain LLMs that were trained slightly differently as "world models". Despite the fact that "world model" typically means !LLM.
I believe the benchmark listed is about simulating the environment for the various tasks, rather than doing them. It seems that the point of this model is to generate sim data to improve other models with
> Figure 1: Overview of Qwen-AgentWorld. Top: Qwen-AgentWorld is a unified native language world
model across seven domains. Bottom: We explore two complementary strategies for applying world
modeling to enhance language agents (mainly using the 35B-A3B model as agent): Decouple and Unify ,
where the world model serves as the environment simulator and agent foundation model, respectively.
The bars above the label "Infinite Real-World Envs" show growth for example from approx 42 to 55 but the red label says "+7.1". It's wrong for all of them.
I think open-ended simulation for agents will be a key component for training and planning. Similar as human dreams simulate different scenarios in our head. Biggest challenge will be simulating more abstract and complex systems.
Few months ago I did experiment with an open-ended world simulation for AI agent, where the simulated world was progressively building itself based on each of agent actions in open-ended manner. The idea was to give an agent infinite possibility regarding tool calling, where the tool call would be approved by the adjudicator, and the world state would change. The key issues with the PoC were:
Anyways the project came to be really funny when you watched agent struggling in desperation to perform real world actions which would be impossible in real world. Main observation was that when presented agent with current action budget, it modulated the creativity and how desperate its actions were.I agree; after running out of data on the internet, and humans being too slow to generate data, simulation is the only frontier left for improving things (training, datasets, reasoning). And it's probably the most ethical one too.
If nothing else I'm glad to see "world models" that are actually modeling some kind of worlds, instead of the term being applied as a hype layer for video/splats diffusion.
Physical simulations seem like next step, but how do you simulate dynamics in complex systems im not sure. Stock market is a good example with many trying to simulate that, but at the end you have to make some tradeoffs in terms of abstraction level you are simulating.
For social backed simulations i guess some kind of grounding will be needed based on real examples, but then the out of distribution cases will need an other solution. As rate of changes in our civilization increases, the out of distribution cases will be more and more prominent.
Is there much evidence we use dreams to pre-emptively simulate scenarios?
Dreaming seems much more likely to be neurological tidying and emotional reprocessing. Helpful for identifying and surfacing long term subconscious needs but not for planning.
My dreams would be precisely useless for making plans from, unless those plans were to involve being caught in public wrapped only in a towel. And even then, I'm not sure they'd be particularly helpful.
I agree, for me it dreaming was always reprocessing. The resimulation of scenarios part i mentioned can be over-assumption and it might be wrong. One thing i noticed is that sometimes i reprocess motoric movements after martial arts lessons, that was my main clue.
Right — I mean, I think it's interesting what "dream" means colloquially.
Like, we "dream up" things, or we "have dreams" (underspecified broad ideals for our best life etc.)
I do wonder if sometimes reprocessing dreams has helped me have a better response for something when it reoccurs — like, how to better respond to being slighted or abused or sometimes even complimented.
But I don't know if those could be said to be "plans" on any level. It's a kind of training, though.
Dreaming does help you train for grief and loss, I think.
And sometimes for me it has encapsulated the wisdom or reassurance of someone I have lost; my father appears to be quite involved in my recovery from burnout and my imagining a better life for myself and he died several years ago.
There’s an hypothesis that states we dream so we don’t lose visual processing neural connections. Similar to what happens in blind people: visual processing neurons are recruited to other sensory tasks due to lack of stimulation. My ed. guess is that dreaming probably serves multiple purposes
There was an early visual neural network demonstration — strong feeling it was called "Yorick" or somesuch as it was built in a plaster skull — that had a square grid of red LEDs to show its output state as a simple picture; when its camera was unplugged the neural network appeared to "dream" in the sense that things it had seen would flicker and swirl in the output.
I saw this in a video in the early 90s and cannot remember where.
Out of curiosity would you be willing to share the full system prompt for the agent in question described in this test?
sure: https://github.com/Srakai/bench-evolve/blob/76677b5066bafbab...
Thanks, interesting. Two years ago I wrote a text adventure game that used a LLM model. The system was very simple, but still was interesting. A friend of mine, Ben Goertzel, has been interesting in games/VR for a long while.
I wrote a book on the subject, but now really old material: AI Agents in Virtual Reality Worlds — J. Wiley, 1996
I'm a huge fan of Ben. Have been tracking the OpenCog initiative for some itmem and think that moving concepts from latentspace to atomspace is best way to efficiently merge our current digital information infrastructure with LLM knowledge.
Regarding your book, I'm shocked that 'AI Agent' concept predates me. Have not read your book, but i think it would be interesting to compare your perspective with today's building blocks which were not available for that time.
I understand what the model is doing. I am struggling to understand where this is going to fit in a workflow. I understand a big gap is that any LLM based ai agent isn't aware of the consequences of its actions because it barely understands the future state its actions will have, hence this model that can.
So, is this like a bolt on where you have an agent powered by an LLM, then the world model reviews the action it wants to take, and the agent confirms this is the intention? Like is this to augment an existing agent with additional capabilities?
I think the next movement is heading to multi model orchestration.
https://developer.nvidia.com/blog/train-small-orchestration-...
I thought in this day and age "world model" also includes robo arm training data and robot arm benchmarks
Never heard that.
A world model builds itself a model of the world in which it can simulate an outcome.
In best case its not depending on robotic, otherwise it will be quite limiting for what you can use it.
You can imagine what happens when you write your boss a very inappropriate email, you don't need robotic arms for it.
https://hugston.com/news/qwen-35b-agentworld-insights
https://hugston.com/models/hugston-qwen-agentworldq4-k-m
This might be pretty big. One of my biggest frustrations with smaller models (especially MoE) is their failure to track workflow state at a high level. I'm constantly reminding them what we decided on or asking them to revisit, and reminding them eats context.
Seems like this might make that a lot less painful. And if not off the bat, with some minimal tuning or even just good prompting.
I'm a fan of this direction. For me the most interesting use case for these world models isn't even training, it's verification. If this thing or some idealized version of it can actually reliably simulate state transitions, could you use it to verify an agent's execution path against hard constraints and replace/eclipse LLMs-as-a-judge?
Well if you can do this then you don't delegate execution path derivation to the agent. The benefit is a predictable coherent world state where you understand the impact of { current state } x { action } without having to enumerate that huge cartesian product.
The smaller of the two models is open weights and available on Huggingface:
https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3B
Give it a day or two and the 'unsloth' people will probably publish a Q6 and Q8 (maybe Q8XL?) quantization in GGUF format for llama-server and other users.
I tried to run it but seems like it is either broken or it does not work on dockerized llama.cpp:
0.01.865.326 E llama_model_load: error loading model: missing tensor 'blk.40.attn_norm.weight'
Eli5? What is this compared to a regular llm assistant model like the base qwen?
A regular LLM acts as a "policy," mapping a current state to a specific action (states → actions). Their new LLM acts as a "world model," mapping a current state and a chosen action to a predicted future state ((states, actions) → subsequent states). Instead of deciding "what to do," its explicit objective is to predict the exact environment observation that will result from the interaction history and the agent's current action.
I assumed at first that it was trained on synthetic data, but they actually went and deployed real physical hosts and virtual machines (e.g. Ubuntu, macOS, and Android) and browsers. They ran agentic systems on these continuously and recorded the actual, real-world interactions.
So it's an LLM that infers next state, or outcome,as structured data e.g. literal HTML code, UI view hierarchies, or accessibility trees.
So, if I'm reading this correctly, whereas a regular LLM would, given a prompt to edit a file, infer a sed call, this "world" model infers the resulting contents of the file.
Here's the demo: https://docs.qwenlm.ai/resources/mlu56_demo.html
Here's the description of the world model prompt for the web domain: "A precise GUI state simulator — given the current screen (as HTML) and a user action, predicts the exact next screen as a complete, self-contained HTML document." (You can click the world model prompt box to expand it and see the full prompt.)
So the world model generates the current state (an html document), an agent tells it what action it wants to perform, the world model generates the next state (another html document).
The other domains are similar, but w/ domain-specific nuance.
Same thing, but qwen has decided to rebrand certain LLMs that were trained slightly differently as "world models". Despite the fact that "world model" typically means !LLM.
The benchmarks here are confusing at best. Am I reading correctly that this model is essentially as good or better than all frontier models right now?
I believe the benchmark listed is about simulating the environment for the various tasks, rather than doing them. It seems that the point of this model is to generate sim data to improve other models with
Benchmarks in general are a little iffy, the whole industry is going off of vibes anyways. Can't decide before trying it out
Note this can run locally on a gaming card with quant. I got it running on a 4090 (24GB) 150 t/s with a Q4_K_M.
What if they did this using GLM 5.2? This looks like a new direction for AI.
10M trajectories, probably more of a data scale win than a world model breakthrough tbh
The labels of the very first chart (figure 1, bottom left) are obviously wrong which casts a doubt on the entire paper.
This label?
> Figure 1: Overview of Qwen-AgentWorld. Top: Qwen-AgentWorld is a unified native language world model across seven domains. Bottom: We explore two complementary strategies for applying world modeling to enhance language agents (mainly using the 35B-A3B model as agent): Decouple and Unify , where the world model serves as the environment simulator and agent foundation model, respectively.
Where is the mistake?
The deltas are wrong.
The bars above the label "Infinite Real-World Envs" show growth for example from approx 42 to 55 but the red label says "+7.1". It's wrong for all of them.
Ah I see. Yeah the graphics are probably AI-generated, and AIs do struggle with unit consistency in charts.
(For another example, the charts in the August 2025 GPT-5 presentation)
According to Table 6, it's supposed to be 47.9 to 55.
OK, then the labels (the numbers) are correct but the bars have the wrong length.
35B model from the qwen-3.5 line
https://github.com/QwenLM/Qwen-AgentWorld
https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3B
unsloth, activate!