Home
Home
Posts List
  1. Training Techniques
    1. Momentum
    2. Adaptive schedules
    3. Batch Normalization
    4. Parameter Initialization
    5. Shortcut

[Theory] [Deep Learning] ch5

Training Techniques

Momentum

Use gradient part plus momentum part to update parameters
$$
\Delta\theta^t = \beta\Delta\theta^{t-1}-\gamma^t\nabla L(t;\theta)|_{\theta=\theta^t}
$$
$\beta \in [0,1)$ is the momentum factor, $\gamma^t$ is learning rate in t-th iteration.

  • reduce noise in Stochastic gradient descent
  • reduce oscillation and accelerate convergence for ill conditioned contour lines

Nesterov momentum:
$$
\Delta\theta^t = \beta\Delta\theta^{t-1}-\gamma^t\nabla L(t;\theta)|_{\theta=\theta^t+\beta\Delta\theta^{t-1}}
$$

Adaptive schedules

  • RMSprop
  • Adam
  • AdaGrad
  • AdaDelta

Batch Normalization

In layer l, $x_{l-1} \rightarrow a_{l-1} \overset{\phi}{\rightarrow} x_l$ , add learnable parameter $\gamma_L$ and $\beta_L$ , to avoid covariate shift over layers and over time during training

Parameter Initialization

  • bias vector: zero
  • weight matrices: random or He Initialization

Shortcut

skip connection, shortcut, residual network (ResNet)

  • forward: make low level feature available in deep layers
  • backward: avoid vanishing gradient