This post outlines the findings from my recent paper, “Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?”, co-authored with Maximilian Triebel and Dominik Helfenstein. The study investigates whether current Vision-Language Models (VLMs) can replicate the physical reasoning and precise interactions required in complex puzzle environments.
Overview
To properly assess these capabilities, we introduced VLATIM (Vision-Language Against The Incredible Machine). This novel benchmark is built around the classic physics puzzle game The Incredible Machine 2. In this environment, players must arrange diverse mechanical components to construct complex, physical cause-and-effect chain reactions to satisfy specific goals.
While recent research has applied Vision-Language-Action (VLA) models to open-world games, point-and-click physics puzzles offer a unique challenge. Unlike existing benchmarks, VLATIM places a strict focus on the human-likeliness of the evaluated capabilities, systematically testing the entire spectrum of cognitive processing from low-level visual perception up to high-level planning and execution.
Key Concepts
1. Testing the Cognitive Spectrum
To isolate where models succeed and fail, VLATIM breaks down problem-solving into progressive cognitive stages:
- Visual Grounding: Identifying and accurately locating specific objects within the scene.
- Domain Understanding: Grasping the physical properties, mechanics, and rules governing the objects.
- Multi-step Manipulation: Planning and executing sequential interactions.
- Full Puzzle Solving: Combining spatial reasoning, planning, and execution to complete complex contraptions.
2. The Disparity Between Reasoning and Execution
Our evaluations revealed a distinct bottleneck in model performance. Large proprietary models demonstrate highly capable high-level strategic planning. However, a clear gap remains between reasoning and execution. When required to translate those abstract plans into precise visual grounding and exact mouse interactions within continuous action spaces, the models fail to achieve human-like problem-solving.
3. Implications for Robotics and World Models
We propose this benchmark as a critical challenge for VLM-based world models. The implications extend directly into robotics, where real-world trial and error is often unsafe, inefficient, or prohibitively expensive. Mastering physical cause-and-effect reasoning in a simulated, reactive environment like VLATIM is an essential prerequisite before deploying VLMs as reliable physical agents in the real world.
Conclusion
The primary takeaway from our research is that while the theoretical planning capabilities of leading VLMs are advancing rapidly, their practical utility in interactive, physics-based tasks is currently bottlenecked by execution and visual grounding errors. Bridging this gap is essential for the development of truly autonomous agents capable of interacting with the physical world.
The full paper is available on arXiv (2605.11223).