EMO-Avatar: An LLM-Agent-Orchestrated Framework for Multimodal Emotional Support in Human Animation

Abstract

Emotional Support Chatbots could unlock potential by providing scalable, low-cost, and personal emotional support, overcoming critical accessibility barriers inherent in traditional counseling. However, current text-based Chatbots fall short in conveying the multimodal empathy crucial in counseling. Humans naturally prefer face-to-face communication with peers to share feelings, encompassing spoken tone, micro-expressions, and body language to convey empathy. To bridge this gap, we propose EMO-Avatar, an LLM-agent-orchestrated framework that integrates emotional reasoning capabilities and multimodal expression in counseling. Our approach introduces two innovations: (1) a Multimodal Emotional Support Agent. EMO-Avatar can follow adaptive instruction across TTS, pose, micro-expressions, and body actions, leading to the generation of highly expressive human animations. (2) a Comforting-Exploration-Action support strategy; EMO-Avatar systematically integrates Hill's three-stage counseling theory into its emotional reasoning capability. Guided by the LLM's reasoning, this strategy informs response generation and displays stage-specific preferences for speech, body language, and expressions. EMO-Avatar can provide deeper emotional support and therapeutic human-like interactions. Experimental validation on the AvaMERG Challenge demonstrates EMO-Avatar's superior performance, achieving top-2 ranking among 20 participants across response appropriateness, multimodal consistency, naturalness, and emotional expressiveness metrics

Team AI4AI
EMO-Avatar - Multimodal Emotional Support Video -TOP2 Solution in ACMMMM25 Challenge

Baidu NetdiskBaidu Netdisk     homepage[Homepage]      challenge[ACMMM25 Challenge Link]     

Comparison with Baseline Frames Generated by Other Models We sample the same audio clip (22 seconds) to drive all baseline models and randomly sample frames at 3-second intervals. The result generated by EchoMimic exhibits noticeable distortion, which we attribute to its lack of training data for half-body human driving. Our method demonstrates superior generation quality and stability, particularly in long-duration, high-resolution video synthesis.

You can access more of our videos through Baidu Netdisk. Shared via Baidu Netdisk with Online Videos: audio video json README — 4 files 🔗 Link: https://pan.baidu.com/s/1hmVOY2ISejRsaRfpiNbkUA?pwd=AI4A 🔑 Access code: AI4A

Network quality may affect video playback results. Please check the youtube connection. PS: Youtube Login is required.

Gallery (Youtube Videos)

Dialogue between two people

Short conversations

Counseling conversations with paralanguage

Long videos

An example of EMO-Avatar's input and output in AvaMERG Challenge. In this case, EMO-Avatar generated the fourth-round emotional support animation based on conversation history (videos) from the previous three rounds. Comforting and exploration strategies are adopted in this case.

🏆 Awards 🏆

Bronze Medal

🥉 Subtask 1 - 3rd Place

Multimodal-Aware Empathetic Response Generation.

Silver Medal

🥈 Subtask 2 - 2nd Place

Multimodal Empathetic Response Generation.

AvaMERG@MM2025 Grand Challenge - Avatar-based Multimodal Empathetic Response Generation https://avamerg.github.io/MM25-challenge/