본문 바로가기

Paper/Multimodal Learning

[MMML] Multimodal Deep Learning (ICML2011)

https://people.csail.mit.edu/khosla/papers/icml2011_ngiam.pdf

 

0. Abstract 

  • THIS paper
    • a series of tasks for multimodal learning & how to train
    • cross modality feature learning
    • how to learn a shared representation between modalities

1. Introduction 

  • start of MMML : speech recognition (audio-visual information), McGurk effect
  • THIS paper
    • focus on modeling "mid-level" relationships
    • task : audio-visual speech classification ( learning representations for speech audio which are coupled with videos of the lips) 
  • phases
    • 1. feature learning
    • 2. supervised training
    • 3. testing
  • learning settings 
    • 1. multimodal fusion : data from all modalities is available at all phases
    • 2. cross modality learning
      • data from multiple modalities is available only during feature learning
      • during the supervised training and testing phase, only data from a single modality is provided
    • 3. shared representation learning
      • different modalities are presented for supervised training and testing.

2. Background

  • THIS paper
    • sparse RBM --> used as layer-wise building block

2.1. Sparse restricted Boltzmann machines

 

[ML] RBM & sparse RBM

 

judingstat.tistory.com

3. Learning architectures

1) Greedily Training a RBM over the pre-trained layers for each modality

 

  • WHY? 
    • just (jointly) concatenating --> limited as a shallow model (learning a shallow bimodal RBM )
    • individual modality -> strong connect
    • few units --> connect across the modalities
  • HOW?
    • the posteriors of the first layer hidden variables --> the training data for the new layer
    • easier for the model to learn higher-order correlations across modalities

 

2) Deep autoencoder

 

  • WHY?
    • X explicit objective for the models to discover 'correlations' across the modalities
      • it is possible for the model to find representations such that some hidden units are tuned only for audio while others are tuned only for video
    • models use in a cross modality learning setting where only one modality is present during supervised training and testing.
      • With only a single modality present, one would need to integrate out the unobserved visible variables to perform inference.
  • HOW?
    • trained to reconstruct both modalities when given only one modality data
    • --> discovers correlations across the modalities
    • initialize <-- the bimodal DBN weights

 

3) training the bimodal deep autoencoder using an augmented but noisy dataset with additional examples that have only a single-modality as input (Inspired by denoising autoencoders)

 

  • WHY?
    • multiple modality --> how to use the deep autoencoder?
    • just use tied decoding weights? --> does not scale well
  • HOW?
    • add examples that have zero values for one of the input modalities (single-modality example)
    • ex. one-third of the training data has only video for input, while another one-third of the data has only audio, and the last one-third of the data has both audio and video.

 

4. Experiments and Results

4.1. Data Preprocessing

4.2. Datasets and Task

4.3. Cross Modality Learning

4.4. Multimodal Fusion Results

4.5. McGurk effect

4.6. Shared Representation Learning

4.7. Additional Control Experiments