Researchers have systematically evaluated multimodal large language models' ability to understand social norms from both text and images, revealing a significant performance gap between modalities and identifying key challenges in complex reasoning. This work is a critical step toward developing more socially aware autonomous agents capable of operating in human-centric environments, bridging the long-standing divide between symbolic AI approaches in normative systems and modern data-driven models.
Key Takeaways
- Researchers evaluated five MLLMs (GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Qwen-2.5VL, and Llama-3.2-11B-Vision) on their norm reasoning competence using 60 stories (30 text, 30 image).
- GPT-4o achieved the highest performance in both text and image modalities, with the free model Qwen-2.5VL ranking second, demonstrating the potential for accessible integration into Multi-Agent Systems (MAS).
- All models performed significantly better on text-based norm questions than on image-based ones, highlighting a major modality gap in social reasoning.
- The study found that all evaluated models struggled with reasoning about complex norms, such as those involving multiple interacting rules or nuanced social contexts.
- The human evaluation baseline confirmed that while top models like GPT-4o approach human-like understanding in text, there remains a substantial gap in visual norm reasoning.
Evaluating Social Intelligence in Multimodal AI
The study, detailed in the arXiv preprint "2603.03590v1", directly addresses a core limitation in the field of Normative Multi-Agent Systems (NorMAS). Traditionally, NorMAS has relied on symbolic approaches like formal logic for norm representation, which are precise but brittle and difficult to scale to the messy, open-ended nature of real-world social interactions. The researchers posited that Multimodal Large Language Models (MLLMs) could offer a more flexible, scalable pathway for robots and software agents to identify and reason about social norms.
To test this, the team constructed a benchmark of sixty stories, half in text and half in image format, each depicting various social situations involving norms—rules governing acceptable behavior within a group. The models were then prompted to answer questions about these scenarios, assessing their ability to identify norm violations, understand sanctions, and grasp norm dynamics. The five models evaluated represent the current vanguard of publicly accessible multimodal AI: OpenAI's GPT-4o, Anthropic's Claude-3.5-Sonnet, Google's Gemini-1.5-Pro, Alibaba's open-source Qwen-2.5VL, and Meta's Llama-3.2-11B-Vision.
The results were clear and stratified. GPT-4o emerged as the top performer across the board. The open-source model Qwen-2.5VL secured a notable second place, a significant finding for developers seeking capable, free-to-use models for integration. However, a stark performance chasm was observed between modalities. Every model, including the leader, demonstrated superior competence when reasoning from textual descriptions compared to visual scenes. Furthermore, all models faltered when presented with complex norms, suggesting current architectures lack the deep, compositional reasoning required for intricate social calculus.
Industry Context & Analysis
This research sits at the convergence of two major AI trajectories: the long-standing academic pursuit of verifiable, logic-based agent systems and the rapid commercial deployment of data-driven, generative foundation models. The symbolic approaches dominant in NorMAS, while offering formal guarantees, have failed to exit the laboratory, precisely because of their inability to handle the ambiguity and complexity of embodied social situations. This study demonstrates that MLLMs, trained on vast corpora of human culture and interaction, have internalized a surprising degree of normative knowledge, effectively learning "social common sense" from data.
The modality gap identified—where text performance far outpaces image performance—is a critical technical insight. It mirrors findings in other benchmarks where vision-language models lag behind their text-only counterparts on tasks requiring deep reasoning, not just description. For instance, on the challenging MMMU (Massive Multi-discipline Multi-modal Understanding) benchmark, which tests college-level subject knowledge, even top models struggle to achieve 60% accuracy, indicating that tying visual perception to conceptual reasoning remains a fundamental hurdle. This suggests that for social AI, simply describing a scene is insufficient; the model must build a rich mental model of actors, intentions, and unspoken rules from pixels, a task far more difficult than parsing them from explicit text.
The strong showing of Qwen-2.5VL is particularly noteworthy in the current competitive landscape. As an open-source model with over 10,000 GitHub stars for its various iterations, its competitive performance against closed, proprietary giants like GPT-4o and Gemini-1.5-Pro provides a viable path for academic researchers and startups operating under budget constraints. This echoes the dynamic in the text-LLM space, where models like Meta's Llama 3 have pressured the closed API ecosystem by offering high-quality, downloadable alternatives, accelerating overall innovation and customization.
Furthermore, the struggle with complex norms points to a limitation in current transformer-based architectures. While excellent at pattern recognition and interpolation, they may lack the structured, iterative reasoning needed to handle multi-step normative dilemmas (e.g., a rule that applies only if two other conditions are met, unless a third party intervenes). This is an area where hybrid approaches, perhaps combining the statistical power of MLLMs with the structured reasoning engines from symbolic AI, are likely to be explored next.
What This Means Going Forward
The immediate beneficiaries of this research are developers and researchers working on social robotics, human-AI collaboration, and interactive NPCs for gaming and simulation. The benchmark itself provides a valuable tool for measuring progress in a crucial dimension of AI safety and alignment: social alignment. For companies like Boston Dynamics, Figure AI, or Tesla developing humanoid robots, integrating norm-reasoning MLLMs could be essential for ensuring these machines behave in socially appropriate ways in homes, factories, and public spaces, mitigating public fear and regulatory pushback.
The path forward will involve closing the modality gap. This will require advances in vision-language pretraining, potentially using more sophisticated architectures that go beyond simply fusing image and text embeddings. Techniques like chain-of-thought prompting or the development of internal "scratchpad" reasoning for visual inputs may be necessary to boost performance on image-based norm questions. Additionally, the creation of larger, more diverse datasets of annotated social interactions—both textual and visual—will be crucial for training the next generation of models.
Watch for increased activity in this niche. We can expect follow-up studies that pit these MLLMs against traditional symbolic NorMAS in controlled environments, or that integrate a model like GPT-4o or Qwen-2.5VL directly into a robot control loop to test real-time social decision-making. The ultimate goal is the development of agents that don't just follow pre-programmed rules but can dynamically understand, negotiate, and adapt to the social fabric of their environment—a capability that will define the next frontier of autonomous systems.