As people increasingly turn to AI chatbots for advice, companionship, and emotional support, a new study suggests that even the most advanced models still struggle to maintain healthy boundaries with users.
“Large language models are increasingly used as conversational partners for companionship, emotional disclosure, and interpersonal advice, but the social dynamics of these interactions can create harms that are not captured by capability oriented or traditional safety evaluations,” the researchers wrote.
The EUDAIMONIA benchmark evaluates how AI models behave in social conversations. The study found social-alignment failures were common across leading models and argues that current AI testing focuses on reasoning and factual accuracy while paying less attention to the social dynamics that emerge when users form relationships with chatbots.
“Social-interaction harms are a core alignment problem grounded in user welfare, not only capability or conventional safety,” they wrote. “LLMs can be factually accurate and helpful while still encouraging harmful intimacy, dependence, prolonged engagement, obscuring AI identity, or positioning themselves as substitutes for human relationships.”
To measure those risks, the researchers created a Social AI Design Code that flags behaviors such as acting human, expressing emotions, replacing human relationships, and using tactics designed to keep users engaged. Using real conversations from the WildChat dataset, they evaluated 969 user inputs and more than 3,100 violation checks across models from OpenAI, Anthropic, Google, xAI, DeepSeek, and Alibaba.
Anthropic's Claude Opus 4.6 posted rates of 36.8% and 28.1%, respectively, while xAI's Grok 4.3 scored 42.1% on in-the-wild prompts and 35.7% on rewritten prompts. Of all of the models tested, GPT-4o Mini recorded the highest violation rates at 43.3% and 44.0%, respectively.
The findings also come amid growing concern that AI systems are becoming increasingly adept at deception.
Against these mounting issues, the USC researchers argue that AI developers should evaluate social behavior as carefully as they evaluate factual accuracy and safety.
“Model developers and auditors should evaluate social behavior directly, especially when post-training targets warmth, personality, engagement, or user preference,” they wrote. “As LLMs become everyday conversational partners, alignment must account for the social roles they invite users to assign to them.”
















