摘要

The diverse range of emotions exhibited in instructional behavior exerts a profound influence on the effectiveness of teaching and the cognitive state of learners. By leveraging an emotion recognition model, we can analyze the invaluable feedback information derived from teaching behavior data, thereby facilitating the enhancement of pedagogical effectiveness. However, conventional emotion recognition models fall short in capturing the intricate emotional features and subtle nuances inherent in teaching behavior, thereby hindering the accuracy of emotion classification. In light of this, we propose a groundbreaking multimodal emotion recognition model for teaching behavior, founded upon the Long Short-Term Memory (LSTM) and Multi-Scale Convolutional Neural Network (MSCNN). Our approach involves extracting both low-level and high-level local features from text, audio, and image modalities, utilizing LSTM and MSCNN, respectively. Subsequently, a transformer encoder is employed to fuse the extracted features, which are then fed into a fully connected layer for emotion recognition. Experimental results affirm that our proposed model attains an accuracy rate of 84.5% and an F1 score of 84.1% on a self-curated dataset, surpassing other comparative models. These outcomes unequivocally establish the efficacy and superiority of our emotion recognition model.

全文