EMO-Avatar: An LLM-Agent-Orchestrated Framework for Multimodal Emotional Support in Human Animation

Abstract

Emotional Support Chatbots could unlock potential by providing scalable, low-cost, and personal emotional support, overcoming critical accessibility barriers inherent in traditional counseling. However, current text-based Chatbots fall short in conveying the multimodal empathy crucial in counseling. Humans naturally prefer face-to-face communication with peers to share feelings, encompassing spoken tone, micro-expressions, and body language to convey empathy. To bridge this gap, we propose EMO-Avatar, an LLM-agent-orchestrated framework that integrates emotional reasoning capabilities and multimodal expression in counseling. Our approach introduces two innovations: (1) a Multimodal Emotional Support Agent. EMO-Avatar can follow adaptive instruction across TTS, pose, micro-expressions, and body actions, leading to the generation of highly expressive human animations. (2) a Comforting-Exploration-Action support strategy; EMO-Avatar systematically integrates Hill's three-stage counseling theory into its emotional reasoning capability. Guided by the LLM's reasoning, this strategy informs response generation and displays stage-specific preferences for speech, body language, and expressions. EMO-Avatar can provide deeper emotional support and therapeutic human-like interactions. Experimental validation on the AvaMERG Challenge demonstrates EMO-Avatar's superior performance, achieving top-2 ranking among 20 participants across response appropriateness, multimodal consistency, naturalness, and emotional expressiveness metrics

Team AI4AI

EMO-Avatar - Multimodal Emotional Support Video -TOP2 Solution in ACMMMM25 Challenge

Baidu NetdiskBaidu Netdisk     homepage[Homepage]      challenge[ACMMM25 Challenge Link]     

Network quality may affect video playback results. Please check the youtube connection. PS: Youtube Login is required.

Our Multimedia Presentation Videos for ACMMM 2025.

Digital Human DEMO Videos (Youtube Videos)

Dialogue between two people

Short conversations

Counseling conversations with paralanguage

Long videos

Comparison with Baseline Frames Generated by Other Models We sample the same audio clip (22 seconds) to drive all baseline models and randomly sample frames at 3-second intervals. The result generated by EchoMimic exhibits noticeable distortion, which we attribute to its lack of training data for half-body human driving. Our method demonstrates superior generation quality and stability, particularly in long-duration, high-resolution video synthesis.

You can access more of our videos through Baidu Netdisk. Shared via Baidu Netdisk with Online Videos: audio video json README — 4 files 🔗 Link: https://pan.baidu.com/s/1hmVOY2ISejRsaRfpiNbkUA?pwd=AI4A 🔑 Access code: AI4A

An example of EMO-Avatar's input and output in AvaMERG Challenge. In this case, EMO-Avatar generated the fourth-round emotional support animation based on conversation history (videos) from the previous three rounds. Comforting and exploration strategies are adopted in this case.

🏆 Awards 🏆

Bronze Medal

🥉 Subtask 1 - 3rd Place

Multimodal-Aware Empathetic Response Generation.

Silver Medal

🥈 Subtask 2 - 2nd Place

Multimodal Empathetic Response Generation.

Our paper has been accepted!

It will soon be accessible in the ACM MM conference proceedings.

AvaMERG@MM2025 Grand Challenge - Avatar-based Multimodal Empathetic Response Generation https://avamerg.github.io/MM25-challenge/