TLDR: A research paper introduces the MhAIM Dataset and T-Lens, an LLM-based system, to study and predict human responses to multimodal AI-generated content. Findings show people are better at identifying AI content when text and visuals are inconsistent, and while generally skeptical, well-crafted AI content can still be persuasive. T-Lens, powered by HR-MCP, effectively models human belief, trustworthiness, and impact, offering a human-centric approach to combat misinformation.
As artificial intelligence continues to advance at a rapid pace, AI-generated content (AIGC) is becoming increasingly common across various platforms, from journalism to social media. While AIGC offers many benefits, it also brings a significant risk: misinformation. Traditional research has largely focused on simply identifying whether content is authentic. However, a new study shifts this focus, exploring how AI-generated content actually influences human perception and behavior.
A recent research paper, titled “Modeling Human Responses to Multimodal AI Content,” introduces a human-centered approach to this challenge. The authors, Zhiqi Shen, Shaojing Fan, Danni Xu, Terence Sim, and Mohan Kankanhalli, highlight that in fields like the stock market, predicting how people will react to a news post – for instance, whether it will go viral – can be more crucial than just verifying its factual accuracy. This work aims to bridge the gap in understanding how people react to and are influenced by multimodal AI-generated content.
To facilitate this large-scale analysis, the researchers developed the MhAIM Dataset. This extensive dataset contains 154,552 online posts, with a significant portion (111,153) being AI-generated. It includes both human-crafted and AI-generated content across various modalities, such as news, social media posts, and even phishing messages. The creation of this dataset involved integrating data from existing multimodal misinformation and authentic sources, and generating new AI content using tools like Stable Diffusion, LangChain, ChatGPT, and the Gemini API.
A key part of the research involved a human study conducted with 765 participants across the US, India, and Singapore. Participants viewed various posts and answered questionnaires about their emotional responses, behavioral tendencies (like belief and sharing intent), perceived AI origin, and consistency between different modalities. The study revealed several important insights into human sensitivity and receptivity towards AIGC.
One significant finding was that people are generally better at identifying AI content when posts include both text and visuals, especially when there are inconsistencies between the two. For news posts, the perceived truthfulness of the content played a vital role in its identification as AI-generated. Interestingly, participants tended to classify content as human-crafted, indicating a general bias. While AI-generated text was better received than AI-generated visuals, human-crafted content consistently garnered higher receptivity than AI-generated content, regardless of its authenticity.
The study also introduced three new metrics to quantify how users judge and engage with online content: trustworthiness (user trust based on belief minus AI likelihood), impact (effect on individuals and community, sum of belief and dissemination propensity), and openness (willingness to engage with perceived AIGC). Although people were generally less likely to believe or share content they suspected was AI-generated, the research found that well-crafted AIGC could still be highly persuasive and gain traction, even when its AI origin was recognized. This highlights the limitations of relying solely on detection methods.
Building on these empirical insights, the researchers propose T-Lens, an LLM-based agent system designed to answer user queries by incorporating predicted human responses to multimodal information. At its core is HR-MCP (Human Response Model Context Protocol), a specialized module that estimates human-centered attributes like trustworthiness, impact, and openness. HR-MCP is built on the standardized Model Context Protocol (MCP), allowing for seamless integration with any large language model (LLM).
The T-Lens framework operates as a ReAct-style agent, reasoning over content and human perception. It uses a CLIP-style architecture with a Vision Transformer (ViT) for images and a BERT-like text encoder to process multimodal inputs. It also includes a sentiment module to assess emotional consistency between text and visuals, which was found to be a crucial factor in human perception. By combining semantic and sentiment embeddings, HR-MCP predicts human responses, enhancing the LLM’s ability to align with human reactions and provide interpretable explanations.
Experimental results demonstrate that T-Lens significantly outperforms conventional misinformation detectors and even powerful multimodal LLMs like GPT-4o and Gemini-2.5-Flash in predicting human responses such as AI likelihood, belief, trustworthiness, and impact. While predicting dissemination remains complex, the model shows strong predictive power. This success underscores the value of T-Lens’s human-centric approach and its ability to integrate multimodal information and affective dimensions.
Also Read:
- Rethinking How We Measure AI Hallucinations
- Unmasking AI Hallucinations in Software Development: A Deep Dive into Code Change Generation
This groundbreaking work provides both empirical insights and practical tools to equip LLMs with human-awareness capabilities. By understanding the complex interplay among AI, human cognition, and information reception, the findings suggest actionable strategies for mitigating the risks of AI-driven misinformation. For more detailed information, you can refer to the full research paper here.


