Even the best AI models fail at visual tasks toddlers handle easily

2026-01-19

Summary

A recent study uncovers a major shortcoming in current AI systems: they struggle with basic visual tasks that toddlers can easily perform. For instance, the top-performing AI model, Gemini-3-Pro-Preview, scored only 49.7% on these tasks, while human adults scored 94.1%. The research attributes these failures to a "verbalization bottleneck," where visual information is converted into language, losing critical details. The study suggests that unified multimodal models that integrate visual processing could enhance AI's visual reasoning abilities.

Why This Matters

This study highlights a crucial gap in AI development, showing that advanced AI models still can't match the basic cognitive functions of young children. It indicates that despite their impressive capabilities in language tasks, AI systems have significant limitations in understanding and processing visual information. Understanding these limitations is essential for guiding future AI development and ensuring more robust and capable systems.

How You Can Use This Info

Professionals in fields like education, healthcare, and user experience design can use this information to temper expectations regarding AI's current capabilities in visual tasks. When designing AI-driven solutions, it's crucial to consider these limitations and potentially integrate alternative approaches, such as unified multimodal models, to enhance visual processing. Additionally, this insight can guide investments in AI research and development to fill these capability gaps.

Read the full article