Thesis of Ziqian Zhang
Subject:
Start date: 16/10/2025
End date (estimated): 16/10/2028
Advisor: Liming Chen
Summary:
The main objective of this Ph.D. research is to develop a novel framework inspired by multimodal frameworks for more comprehensive and accurate subtle emotion recognition by fusing multimodal data, including visual, auditory, textual, and physiological signals. Emotion recognition that refers to understanding and identifying the emotional states automatically by intelligent systems plays a pivotal role in human-computer interaction, human-robot interaction and physiological conditions detection. Traditional emotion recognition systems primarily focus on basic emotions, such as happiness, anger, sadness,etc.. However, in real-world contexts, human emotions are often more subtle, mixed or socially masked (e.g., suppressed frustration, hidden admiration). These complex emotions are conveyed through a combination of multiple channels, including facial expressions, voice, body language, and physiological responses. Hence, understanding human emotions in a more comprehensive and accurate way is essential for applications such as human-robot interaction.
Subtle emotion recognition, which means recognizing emotional states particularly at a fine-grained level in this research, remains an underexplored and challenging area. Two key obstacles contribute to this challenge: on the one hand, effectively synchronizing and fusing data from multiple modalities is complex; on the other hand, obtaining training data for rare or novel emotions is costly and time-consuming. Collecting and annotating large volumes of stimuli is especially difficult, particularly given the rapidly expanding scope of personal emotional annotations. Therefore, this research intends to develop a novel framework for subtle emotion recognition that integrates multiple modalities into a shared embedding space, facilitating a more nuanced understanding of human emotions. Given the difficulty of collecting large-scale annotated emotional datasets, self-supervised learning methods and large language models can be leveraged to reduce the dependency on extensive, labor-intensive data annotation. The proposed framework seeks to achieve more comprehensive understanding of human emotions in naturalistic scenarios, enabling applications ranging from mental health monitoring to more advanced human-robot interaction.
This research aims to:
• Investigate the challenges posed by multimodal subtle emotion recognition, including data collection and annotation, data integration from diverse sources, model sensitivity with subtle emotions and generalization across different cultures.
• Design a multimodal subtle emotion recognition framework that integrates visual, auditory, textual, and potentially physiological cues.
• Develop algorithms and techniques to avoid the necessity of large-scale annotated data and adapt to sparse or even missing data from some modalities while still providing accurate emotion recognition.
• Evaluate the proposed framework on benchmark datasets and compare its performance with state-of-the-art emotion recognition approaches.
• Investigate the generalization of the proposed framework across different individuals and cultural contexts.