[MMML] Multimodal Deep Learning (ICML2011)

https://people.csail.mit.edu/khosla/papers/icml2011_ngiam.pdf

0. Abstract

THIS paper
- a series of tasks for multimodal learning & how to train
- cross modality feature learning
- how to learn a shared representation between modalities

1. Introduction

start of MMML : speech recognition (audio-visual information), McGurk effect
THIS paper
- focus on modeling "mid-level" relationships
- task : audio-visual speech classification ( learning representations for speech audio which are coupled with videos of the lips)
phases
- 1. feature learning
- 2. supervised training
- 3. testing
learning settings
- 1. multimodal fusion : data from all modalities is available at all phases
- 2. cross modality learning
  - data from multiple modalities is available only during feature learning
  - during the supervised training and testing phase, only data from a single modality is provided
- 3. shared representation learning
  - different modalities are presented for supervised training and testing.

2. Background

THIS paper
- sparse RBM --> used as layer-wise building block

2.1. Sparse restricted Boltzmann machines

ABOUT RBM & sparse RBM

[ML] RBM & sparse RBM

judingstat.tistory.com

3. Learning architectures

1) Greedily Training a RBM over the pre-trained layers for each modality

WHY?
- just (jointly) concatenating --> limited as a shallow model (learning a shallow bimodal RBM )
- individual modality -> strong connect
- few units --> connect across the modalities
HOW?
- the posteriors of the first layer hidden variables --> the training data for the new layer
- easier for the model to learn higher-order correlations across modalities

2) Deep autoencoder

WHY?
- X explicit objective for the models to discover 'correlations' across the modalities
  - it is possible for the model to find representations such that some hidden units are tuned only for audio while others are tuned only for video
- models use in a cross modality learning setting where only one modality is present during supervised training and testing.
  - With only a single modality present, one would need to integrate out the unobserved visible variables to perform inference.
HOW?
- trained to reconstruct both modalities when given only one modality data
- --> discovers correlations across the modalities
- initialize <-- the bimodal DBN weights

3) training the bimodal deep autoencoder using an augmented but noisy dataset with additional examples that have only a single-modality as input (Inspired by denoising autoencoders)

WHY?
- multiple modality --> how to use the deep autoencoder?
- just use tied decoding weights? --> does not scale well
HOW?
- add examples that have zero values for one of the input modalities (single-modality example)
- ex. one-third of the training data has only video for input, while another one-third of the data has only audio, and the last one-third of the data has both audio and video.

4. Experiments and Results

4.1. Data Preprocessing

4.2. Datasets and Task

4.3. Cross Modality Learning

4.4. Multimodal Fusion Results

4.5. McGurk effect

4.6. Shared Representation Learning

4.7. Additional Control Experiments

'Paper > Multimodal Learning' 카테고리의 다른 글

[Graph&Text] ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings (0)	2023.11.13
[ML] RBM & sparse RBM (0)	2022.05.16
[MMML] Multimodal Machine Learning Introduction (CMU LTI-11777 Lecture1.1) (0)	2022.05.10

IIIIIIIIIIII

[MMML] Multimodal Deep Learning (ICML2011)

0. Abstract

1. Introduction

2. Background

2.1. Sparse restricted Boltzmann machines

3. Learning architectures

4. Experiments and Results

'Paper > Multimodal Learning' 카테고리의 다른 글

티스토리툴바

[MMML] Multimodal Deep Learning (ICML2011)

0. Abstract

1. Introduction

2. Background

2.1. Sparse restricted Boltzmann machines

3. Learning architectures

4. Experiments and Results

'Paper > Multimodal Learning' 카테고리의 다른 글

'Paper/Multimodal Learning' Related Articles

티스토리툴바