Researchers have systematically evaluated multimodal large language models (MLLMs) for their ability to understand and reason about social norms, a critical capability for developing more sophisticated and socially-aware multi-agent and robotic systems. The study reveals a significant performance gap between text-based and image-based norm reasoning, highlighting both the promise and current limitations of using foundation models for complex social intelligence tasks.
Key Takeaways
- Five leading MLLMs, including GPT-4o and Qwen-2.5VL, were evaluated on their norm reasoning competence using 60 total scenarios (30 text, 30 image).
- All models demonstrated superior performance in text-based norm reasoning compared to image-based scenarios, indicating a modality-specific challenge.
- GPT-4o achieved the best overall performance across both modalities, making it the most promising candidate for integration into normative multi-agent systems (NorMAS).
- The open-source model Qwen-2.5VL from Alibaba performed second-best, offering a capable free alternative.
- A key finding is that all evaluated models, including the top performers, struggled with reasoning about complex norms, pointing to a frontier for model improvement.
Evaluating Social Intelligence in Multimodal AI
The research paper, hosted on arXiv (ID 2603.03590v1), addresses a core challenge in artificial intelligence: equipping systems with social intelligence. In Normative Multi-Agent Systems (NorMAS), which study how rules (norms) emerge and are enforced in agent societies, traditional approaches have relied on symbolic methods like formal logic. While precise, these methods are brittle and difficult to scale to the messy, open-ended complexity of real-world social situations depicted in text and imagery.
The study posits that Multimodal Large Language Models (MLLMs) could bridge this gap. To test this, researchers constructed a benchmark of sixty stories—half textual, half visual—that embed social norms. The models were then asked norm-related questions about these scenarios, and their responses were compared against human answers to gauge competence. The five evaluated models represent a mix of proprietary and open-source leaders: GPT-4o (OpenAI), Gemini 1.5 Pro (Google), Claude 3.5 Sonnet (Anthropic), Qwen-2.5VL (Alibaba), and Llama-3.2-11B-Vision (Meta).
The results were clear-cut. While MLLMs showed competence, their performance was not uniform. GPT-4o emerged as the top performer, suggesting its underlying architecture and training data currently offer the most robust foundation for social reasoning. Notably, Qwen-2.5VL, a freely available model, secured a strong second place, highlighting the rapid advancement of open-source vision-language models. The universal struggle with complex norms, however, underscores that social reasoning remains a substantial challenge beyond simple rule recognition.
Industry Context & Analysis
This research sits at the convergence of two major industry trends: the push toward embodied AI for robotics and the development of agentic AI systems that can perform complex, multi-step tasks. Companies like OpenAI, Google DeepMind, and Figure AI are actively working on models that can power robots to understand and navigate human environments, where social norms are as critical as physical laws. A robot that cannot infer it should queue in line or understand a prohibitive gesture is fundamentally limited.
The performance gap between text and image modalities is analytically significant. It suggests that while MLLMs have ingested vast corpora of textual narratives describing social situations, their ability to ground those concepts in visual scenes is less mature. This aligns with known benchmarks; for instance, models often score higher on text-only reasoning tasks like MMLU (Massive Multidisciplinary Language Understanding) than on multimodal benchmarks requiring deep visual comprehension. The finding implies that for immediate applications, NorMAS integrating LLMs might rely more on textual environment descriptions, while truly vision-based social robots will require further model refinement.
The strong showing of Qwen-2.5VL against established giants like Gemini and Claude is a notable data point in the open-source vs. proprietary model race. With over 1.3 million pulls on Hugging Face for the Qwen2.5 series, its competitive performance in a specialized task like norm reasoning validates its architectural choices and training. It provides a viable, customizable pathway for academic and commercial teams who cannot access closed API models, potentially accelerating research in socially-aware AI.
Furthermore, the struggle with "complex norms" reveals the limitation of current benchmark evaluations. Most common AI benchmarks test factual knowledge, coding (HumanEval), or mathematical reasoning. There is no widely adopted benchmark for nuanced social or ethical reasoning, which often involves context, cultural specificity, and conflicting principles. This study's custom evaluation highlights a gap in the standard model report card, suggesting that as AI integration deepens, new benchmarks for practical ethics and social context understanding will become essential.
What This Means Going Forward
The immediate beneficiaries of this research are roboticists and multi-agent system designers. The study provides a empirical foundation for model selection: for high-stakes or prototype development where cost is less concern, GPT-4o is the leading choice. For budget-conscious projects, open-source experimentation, or on-premise deployment, Qwen-2.5VL presents a compelling alternative. This could influence procurement and research directions across academia and industry labs.
We should expect a surge in focused efforts to improve the visual grounding of social concepts in MLLMs. This will likely involve creating larger, more nuanced datasets of annotated social interactions in video and images, moving beyond static scenes to dynamic narratives. The research community may develop specialized fine-tuning techniques or auxiliary neural modules dedicated to norm inference, moving beyond the model's purely associative capabilities to more structured reasoning.
Finally, this work underscores a critical evolution in AI safety and alignment. As models move from chatbots to actors in shared environments (digital or physical), their alignment must encompass social alignment—behaving in ways that conform to group expectations and welfare. The ability to identify norms is the first step toward this. The next, more complex step—deciding when to follow, challenge, or help evolve a norm—will define the next generation of autonomous systems. Watch for follow-up research that connects norm recognition to action planning and for industry leaders to begin highlighting social reasoning capabilities in their model releases, much as they currently tout coding or mathematical prowess.