Advancing multimodal emotion recognition in big data through prompt engineering and deep adaptive learning

Document Type

Article

Publication Date

Summer 8-26-2025

Abstract

Emotion recognition in dynamic and real-world environments presents significant challenges due to the complexity and variability of multimodal data. This paper intro duces an innovative Multimodal Emotion Recognition (MER) framework that seam lessly integrates text, audio, video, and motion data using advanced machine learning techniques. To address challenges such as class imbalance, the framework employs Generative Adversarial Networks (GANs) for synthetic sample generation and Dynamic Prompt Engineering (DPE) for enhanced feature extraction across modalities. Text features are processed with Mistral-7B, audio with HuBERT, video with TimeSformer and LLaVA, and motion with MediaPipe Pose. The system efficiently fuses these inputs using Hierarchical Attention-based Graph Neural Networks (HAN-GNN) and Cross Modality Transformer Fusion (XMTF), further improved by contrastive learning with Prototypical Networks to enhance class separation. The framework demonstrates exceptional performance, achieving training accuracies of 99.92% on IEMOCAP and 99.95% on MELD, with testing accuracies of 99.82% and 99.81%, respectively. High precision, recall, and specificity further highlight the robustness of the model. While trained on batch-processed datasets, the framework has been optimized for real-time applications, demonstrating computational efficiency with training completed in just 5 min and inference times under 0.4 ms per sample. This makes the system well suited for real-time emotion recognition tasks despite being trained on batch data. It also generalizes effectively to noisy and multilingual settings, achieving strong results on SAVEE and CMU-MOSEAS, thereby confirming its resilience in diverse real-world scenarios. This research advances the field of MER, offering a scalable and efficient solution for affective computing. The findings emphasize the importance of refining these systems for real-world applications, particularly in complex, multimodal big data environments.

Share

COinS