Tutorial on How to build World models

Tutorial will take place at Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP) 2025 organised by IIT Mandi from Dec 17-20.

Overview 🚀

What Is a World Model? World models are generative AI models that understand the dynamics of the real world, including physics and spatial properties. They use input data, including text, image, video, and movement, to generate videos. They understand the physical qualities of real-world environments by learning to represent and predict dynamics like motion, force, and spatial relationships from sensory data.

What Are the Real-World Applications of World Foundation Models (WFM)?

World models, when used with 3D simulators, serve as virtual environments to safely streamline and scale training for autonomous machines. With the ability to generate, curate, and encode video data, developers can better train autonomous machines to sense, perceive, and interact with dynamic surroundings.

Autonomous Driving: WFMs bring significant benefits to every stage of the autonomous vehicle (AV) pipeline. With pre-labeled, encoded video data, developers can curate and train the AV stack to recognize the behavior of vehicles, pedestrians, and objects more accurately. These models can also generate new scenarios, such as different traffic patterns, road conditions, weather, and lighting, to fill training gaps and expand testing coverage. They can also create predictive video simulations based on text and visual inputs, accelerating virtual training and testing.
Robotics: WFMs generate photorealistic synthetic data and predictive world states to help robots develop spatial intelligence. Using virtual simulations powered by physical simulators, these models let robots practice tasks safely and efficiently, accelerating learning through rapid testing and training. They help robots adapt to new situations by learning from diverse data and experiences.
Video Analytics: Trained with rich, multimodal data and advanced reasoning capabilities, WFMs can perform complex video analytics on massive amounts of recorded and live videos. These models enable natural language Q&A, automated summarization, object detection, event localization, and richer contextual understanding of visual content in videos—capabilities that surpass traditional computer vision methods.

What Are the Benefits of World Foundation Models?

Building a world model for a physical AI system, like a self-driving car, is resource- and time-intensive. First, gathering real-world datasets from driving around the globe in various terrains and conditions requires petabytes of data and millions of hours of simulation footage. Next, filtering and preparing this data demands thousands of hours of human effort. Finally, training these large models costs millions of dollars in GPU compute and requires many GPUs.

WFMs aim to capture the underlying structure and dynamics of the world, enabling more sophisticated reasoning and planning capabilities. Trained on vast amounts of curated, high-quality, real-world data, these neural networks serve as visually, spatially, and physically aware synthetic data generators for physical AI systems.

WFMs allow developers to extend generative AI beyond the confines of 2D software and bring its capabilities into the real world while reducing the need for real-world trials. While AI’s power has traditionally been harnessed in digital domains, world models will unlock AI for tangible, real-world experiences.

This three hour tutorial offers a comprehensive journey from the foundational theory of World Models (e.g., the original WorldModel paper and Dreamer architecture) to their modern implementation using state-of-the-art generative techniques. Attendees will gain the knowledge and practical skills needed to design, train, and apply these models in challenging computer vision and embodied AI tasks.

By the end of this tutorial, you will be able to:

Understand the Core Components: Detail the roles of the Vision Model (VAE/Vector-Quantized VAE), the Memory Model (Recurrent Dynamics), and the Controller (Policy Network).
Appreciate Scalability: Grasp the advancements in modern models that enable scaling World Models to high-fidelity, complex environments.

Speakers

We will update this section soon.

Prerequisites

To get the most out of the hands-on sessions, attendees should be comfortable with:

Deep Learning Fundamentals: Working knowledge of CNNs, RNNs/Transformers, and loss functions.

Python & PyTorch/TensorFlow: Ability to read and modify deep learning code.

Basic RL Concepts: Familiarity with terms like state, action, reward, and policy.

Organisers

Ankit Dhiman (PhD Student at Vision and AI Lab (VAL) at IISc, Bangalore)
Badrinath Singhal (PhD Student at Vision and AI Lab (VAL) at IISc, Bangalore)
AS Anudeep (Masters Student at IISc, Bangalore)
Rishub Parihar (PhD Student at Vision and AI Lab (VAL) at IISc, Bangalore)

Questions: For any inquiries regarding the tutorial content, please contact us at badrinaths@iisc.ac.in with subject of the email as “Tutorial at ICVGIP 2025: Building World Models”.

References

Credits for the website