Training Techniques
Momentum
Use gradient part plus momentum part to update parameters
$$
\Delta\theta^t = \beta\Delta\theta^{t-1}-\gamma^t\nabla L(t;\theta)|_{\theta=\theta^t}
$$
$\beta \in [0,1)$ is the momentum factor, $\gamma^t$ is learning rate in t-th iteration.
- reduce noise in Stochastic gradient descent
- reduce oscillation and accelerate convergence for ill conditioned contour lines
Nesterov momentum:
$$
\Delta\theta^t = \beta\Delta\theta^{t-1}-\gamma^t\nabla L(t;\theta)|_{\theta=\theta^t+\beta\Delta\theta^{t-1}}
$$
Adaptive schedules
- RMSprop
- Adam
- AdaGrad
- AdaDelta
- …
Batch Normalization
In layer l, $x_{l-1} \rightarrow a_{l-1} \overset{\phi}{\rightarrow} x_l$ , add learnable parameter $\gamma_L$ and $\beta_L$ , to avoid covariate shift over layers and over time during training


Parameter Initialization
- bias vector: zero
- weight matrices: random or He Initialization
Shortcut
skip connection, shortcut, residual network (ResNet)
- forward: make low level feature available in deep layers
- backward: avoid vanishing gradient