Breaking: MLLMs Struggle with Social Norm Reasoning in Images

Researchers have systematically evaluated multimodal large language models (MLLMs) on their ability to understand and reason about social norms, a critical capability for future multi-agent and robotic systems. The study reveals a significant performance gap between text and image-based reasoning, highlighting both the promise and current limitations of using foundation models for complex social intelligence in embodied environments.

Key Takeaways

Five leading MLLMs were evaluated on 60 norm-based stories (30 text, 30 image), with GPT-4o achieving the highest overall performance, followed by the open-source model Qwen-2.5VL.
All models demonstrated significantly stronger norm reasoning competence in text-based scenarios compared to image-based scenarios, indicating a modality-specific challenge.
The models universally struggled with reasoning about complex norms, such as those involving multiple actors, conflicting obligations, or nuanced social contexts.
The research positions MLLMs as a potential successor to symbolic approaches (e.g., formal logic) in Normative Multi-Agent Systems (NorMAS), which are limited to simplified environments.
Human performance was used as a benchmark, with models showing competitive but not superior reasoning in text, while lagging in visual comprehension of social situations.

Evaluating Social Intelligence in Multimodal AI

The paper, "Assessing Norm Reasoning in Multimodal Large Language Models," presents a novel evaluation framework for a core component of social intelligence. The researchers tested five MLLMs—GPT-4o, Claude-3.5 Sonnet, Gemini-1.5 Pro, Qwen-2.5VL, and LLaVA-NeXT—on their ability to answer questions about social norms embedded within short stories. The 60 curated scenarios covered norms like queuing, privacy, ownership, and helping behavior.

Each model was prompted to identify the relevant norm, the characters involved, any violations, and appropriate sanctions. Performance was measured by comparing model outputs to human-generated answers. The results showed a clear hierarchy: GPT-4o was the top performer across both modalities, with the freely available Qwen-2.5VL from Alibaba often securing second place, demonstrating the rapid advancement of open-source vision-language models.

Critically, the evaluation exposed a stark modality gap. While models could reason about norms described in text with relative proficiency, their performance dropped markedly when the same social situation had to be inferred from an image. This suggests that current MLLMs' "understanding" of visual social cues is not as deeply integrated as their textual reasoning, a significant hurdle for applications in robotics and real-world interaction.

Industry Context & Analysis

This research sits at the convergence of two major AI trends: the push for agentic AI capable of autonomous action and the long-standing challenge of encoding social commonsense into machines. Traditionally, Normative MAS (NorMAS) relied on symbolic logic, which is precise but brittle and fails to scale to the open-ended complexity of human social environments. The study directly positions MLLMs as a scalable, data-driven alternative, leveraging their broad pretraining on human culture and interaction.

The performance hierarchy revealed is consistent with broader industry benchmarks. GPT-4o's lead aligns with its top-tier scores on multimodal benchmarks like MMMU (Massive Multidisciplinary Multimodal Understanding) and its architectural focus on native, integrated multimodal processing. The strong showing of Qwen-2.5VL (likely with 7B to 14B parameters) against larger, closed models like Claude-3.5 Sonnet (reportedly ~40B parameters) and Gemini 1.5 Pro is notable. It reflects the intense competition in the open-source vision-language space, where models from Qwen, LLaVA, and InternVL are rapidly closing the gap with proprietary APIs, as evidenced by their climbing scores on leaderboards like OpenCompass and Hugging Face's multimodal arena.

The identified modality gap has profound technical implications. It suggests that for many models, the vision encoder and the LLM are not as deeply fused as assumed; the system may be describing an image and then reasoning about the description, rather than reasoning directly from visual concepts. This "detour" through text can lose nuanced spatial, relational, and contextual information crucial for social judgment. Furthermore, the struggle with complex norms mirrors known limitations in LLM reasoning on tasks requiring multi-step inference or theory of mind, as seen in weaker performance on harder subsets of benchmarks like MMLU (Massive Multitask Language Understanding) or dedicated theory-of-mind evaluations.

This follows a pattern of the industry using LLMs to overcome the limitations of classical AI. Just as LLMs replaced hand-crafted grammars in NLP and learned representations replaced symbolic planning in some game-playing AI, they are now being probed as a solution for the notoriously difficult problem of social reasoning. The market driver is clear: companies like Figure AI (valued at ~$2.6B), 1X Technologies, and Boston Dynamics are racing to develop embodied robots that can operate in human spaces, creating immense demand for robust, flexible social AI.

What This Means Going Forward

The immediate beneficiaries of this research are developers working on collaborative robotics and multi-agent simulations. For teams building warehouse robots, domestic assistants, or virtual training environments, the study provides a concrete benchmark and confirms that leading MLLMs, particularly GPT-4o, are viable starting points for norm-aware systems. The strong performance of Qwen-2.5VL offers a cost-effective, customizable open-source alternative for prototyping and deployment.

The field will likely see a surge in specialized datasets and benchmarks for social AI. Following the template of this paper, future benchmarks will need more granular tasks—distinguishing between a social faux pas and a serious ethical violation, or reasoning about cultural relativism in norms. We can expect these evaluations to become standard components of model cards for agentic AI systems, similar to how safety evaluations like CyberSecEval or TruthfulQA are now reported for base LLMs.

Technologically, the modality gap points to the next frontier: embodied multimodal training. To truly excel, models may need training not just on static image-text pairs, but on video data of social interactions or even simulated embodied experience where actions have social consequences. This aligns with efforts like Google's Robotics Transformer models and research in embodied AI. Furthermore, the challenge with complex norms will drive architectural innovations, such as improved chain-of-thought prompting for social reasoning or hybrid systems that use an MLLM for perception and a more structured symbolic engine for complex normative logic.

Watch for two key developments next. First, the integration of these norm-reasoning evaluations into the development cycles of humanoid robots from companies like Tesla, Figure, and Sanctuary AI. Second, the potential emergence of "norm" or "constitution" datasets used to fine-tune models for specific social environments (e.g., workplace conduct, elder care etiquette), creating a new niche for AI data vendors and a critical component for ensuring the safe and effective deployment of autonomous agents in society.

Social Norm Reasoning in Multimodal Language Models: An Evaluation

Key Takeaways

Evaluating Social Intelligence in Multimodal AI

Industry Context & Analysis

What This Means Going Forward

常见问题

Key Takeaways

Evaluating Social Intelligence in Multimodal AI

Industry Context & Analysis

What This Means Going Forward

常见问题

相关推荐

Social Norm Reasoning in Multimodal Language Models: An Evaluation

Social Norm Reasoning in Multimodal Language Models: An Evaluation

Social Norm Reasoning in Multimodal Language Models: An Evaluation

Beyond Pixel Histories: World Models with Persistent 3D State

Social Norm Reasoning in Multimodal Language Models: An Evaluation

Beyond Pixel Histories: World Models with Persistent 3D State