Multimodal Learning for Facial Expression Recognition
Wei Zhang1, Youmei Zhang1, Lin Ma2, Jingwei Guan2, and Shijie Gong1
1School of Control Science and Engineering, Shandong University, Jinan, China
2Huawei Noah's Ark Lab, Hong Kong
Wei Zhang1, Youmei Zhang1, Lin Ma2, Jingwei Guan2, and Shijie Gong1
1School of Control Science and Engineering, Shandong University, Jinan, China
2Huawei Noah's Ark Lab, Hong Kong
In this paper, multimodal learning for facial expression recognition(FER) is proposed. The multimodal learning method makes the first attempt to learn the joint representation by considering the texture and landmark modality of facial images, which are complementary with each other. In order to learn the representation of each modality and the correlation and interaction between different modalities, the structured regularization(SR) is employed to enforce and learn the modality-specific sparsity and density of each modality, respectively. By introducing SR,the comprehensiveness of the facial expression is fully taken into consideration, which can not only handle the subtle expression but also perform robustly to different input of facial images. With the proposed multimodal learning network, thejoint representation learning from multimodal inputs will be more suitable for FER.
The contributions of this work:
The databases contain both texture and landmark modalities for each facial image. These two modalities reflect different properties of the facial expression, which should be considered together for FER. The texture and landmark modalities of the facial image will be first processed, respectively, before being fed into the multimodal FER system.
Fig. 1. The structure of our approach
Fig. 2. The structure of the network
Multilodal FER:
The proposed learning architecture is illustrated in Fig. 2, which takes different numbers and types of modalities as inputs. The output will be the joint representation, which not only considers each modality property but also accounts for the interactions of different modalities.
For the texture modality, the image patches are extracted around eyes and mouth from one frame, which contain the most pivotal facial features related to expressions; For the landmark modality, we calculate the different value between current frame and previous one in X and Y direction as the movements of landmarks.
Fig.3. The structure of AE without SR(a) and with(b) SR
The network is pre-trained by auto-encoder(AE), considering each modality property and the interactions of different modalities, structured regularization(SR) is attached to AE. The structure of AE without and with SR is shown in Fig.3.
Comparison to prior study on CK+ database(First 6 frames)
Comparison to prior study on CK+ database(First-Last)
Comparison of the algorithms with and without AE
The recognition result with respect to the number of (a) hidden layers, (b) hidden units
Experimental results on spontaneous facial expression database
Wu et al. Wu T, Bartlett M S, Movellan J R. Facial expression recognition using gabor motion energy filters[C]//Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on. IEEE, 2010: 42-47.[Full Text]Long et al. Long F, Wu T, Movellan J R, et al. Learning spatiotemporal features by using independent component analysis with application to facial expression recognition[J]. Neurocomputing, 2012, 93: 126-132. [Full Text]Jeni et al. Jeni L A, Girard J M, Cohn J F, et al. Continuous au intensity estimation using localized, sparse facial feature space[C]//Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on. IEEE, 2013: 1-7. [Full Text]Lorincz et al. Lorincz A, Jeni L, Szabo Z, et al. Emotional expression classification using time-series kernels[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2013: 889-895. [Full Text]Yang et al. "Yang P, Liu Q, Cui X, et al. Facial expression recognition using encoded dynamic features[C]//Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008: 1-8 [Full Text]Wang et al. Z. Wang, E. P. Simoncelli, and A. C. Bovik, "Multiscale structural similarity for image quality assessment", Proc. Asilomar Conference on Signals, Systems, and Computers, 2003. [Full Text]He et al. He S, Wang S, Lv Y. Spontaneous facial expression recognition based on feature point tracking[C]//Image and Graphics (ICIG), 2011 Sixth International Conference on. IEEE, 2011: 760-765. [Full Text]
Update: Apr. 27, 2016