본문 바로가기

Paper/Continual Learning

[CL] Lifelong Learning with Dynamically Expandable Networks (ICLR 2018)

https://openreview.net/pdf?id=Sk7KsfW0- 

0. Abstarct

  • Dynamically Expandable Network (DEN) 
    • dynamically decide its network capacity as it trains on a sequence of tasks
    • trained in an online manner by performing selective retraining

1 Introduction

  • lifelong learning → trained in an online manner by performing selective retraining
    • [ Strategy 1 ]  Fine-tune : training both origin & new task → degenerate ( catastrophic forgetting)
  • how can we ensure that the knowledge sharing through the network is beneficial for all tasks?
    • [ Strategy 2 ] Regularization : prevents the parameters from drastic changes 
  • [ Our Strategy ]  
    • retrain the network at each task t such that each new task utilizes and changes only the relevant part of the previous trained network
    • still allowing to expand the network capacity when necessary
    • Challenges
      • 1) Achieving scalability and efficiency in training 
      • 2) Deciding when to expand the network, and how many neurons to add
      • 3) Preventing semantic drift, or catastrophic forgetting

2 Related Work

(1) Lifelong learning

  • continual learning

(2) Preventing catastrophic forgetting

  • catastrophic forgetting : where the retraining of the network for new tasks results in the network forgetting what are learned for previous tasks
    • Solution
      • regularizer (e.g. l2 regularizer)
      • Elastic Weight Consolidation (EWC) : regularizes the model parameter at each step

(3) Dynamic network expansion

  • : neural networks that can dynamically increase its capacity during training
    • incrementally train adenoising autoencoder
    • nonparametric NN model → also find the minimum dimensionality of each layer that can reduce the loss
  • multi-task setting → considered X 

3 Incremental Learning of a Dynamically Expandable Network

DEN

Selective retraining / Dynamic network expansion / Network split/duplication

  • goal : to learn models for a sequence of T tasks ( T : unbounded )

(1) Algorithm 1 Incremental Learning of a Dynamically Expandable Network

  • lifelong learning agent at time t  : aim to minimize the loss

task specific loss function
weight tensor
DEN's incremental learning process

  • let the network to maximally utilize the knowledge obtained from the previous tasks → dynamically expand

Modules

(2) Algorithm 2 Selective Retraining

  • most naive method : retraining the entire model every time  → costly
  • DEN : selective retraining (retraining only the weights that are affected by the new task)
  • Step1 
    • l1-regularization for sparsity in the weights (each neuron is connected to only few neurons )

layer

  • Step2
    • fit a sparse linear model to predict task t using topmost hidden units of the neural network via solving the following problem:

  • Step 3
    • BFS on the network starting from those selected nodes → to identify all input units -- connected -- output units
    • train only the weights of the selected subnetwork S
    • l2 regularizer (already sparse)

 

(3) Algorithm 3 Dynamic Network Expansion

  • new task is highly relevant to the old ones (aggregated partial knowledge obtained from each task is sufficient to explain the new task)
    • accurately represent X → new nuerons are needed
  • group sparse regularization → to dynamically decide how many neurons to add w/o repeated retraining 

  • After selective training
    • checks if the loss is below certain threshold
    •   group sparse regularization 
    •  unnecessary hidden units will be dropped altogether
    • → expect the model to capture new features that were not previously represented in t-1

(4) Algorithm 4 Network Split/Duplication

  • CL's crucial problem
    • semantic drift = catastrophic forgetting
  • λ |Wt - Wt−1|
    • l2 regularization enforce the solution Wt to be found close to Wt−1
    • high λ try to preserve the knowledge learned at previous tasks

  • + Split / Duplicate
    • have features that are optimal for two different tasks
    • can be performed for all hidden units in parallel
    • after split/duplicate  need to train the weights again

(5) Timestamped Inference

  • timestamp each newly added unit j

4 Experiment

(1) Baselines and our model

  • 1) Feedforward networks
  • 2) Convolutional networks

(2) Base network settings.

(3) Datasets

  • 1) MNIST-Variation
  • 2) CIFAR-100
  • 3) AWA (Animals with Attributes)

4.1 Quantative Evaluation

(1) Effect of selective retraining

(2) Effect of network expansion

(3) Effect of network split/duplication and timestamped inference

5 Conclusion

6 REFERENCES

'Paper > Continual Learning' 카테고리의 다른 글

[CL] End-to-End Incremental Learning  (0) 2022.08.02