When training a neural network, after defining the model architecture, a crucial step is to properly initialize the weights. This initialization is essential to achieve stable and efficient training. Proper weight initialization helps prevent issues such as exploding or vanishing gradients, which can significantly hinder the learning process.

It turns out that if you do it wrong, it can lead to exploding or vanishing weights and gradients. That means that either the weights of the model explode to infinity, or they vanish to 0 (literally, because computers can’t represent infinitely accurate floating point numbers), which make training deep neural networks very challenging.

When initializing the neural network, there are few properties we would like to have.

  • First, the variance of the input should be propagated through the model to the last layer, so that we have a similar standard deviation for the output neurons.
  • The second property we look out for in initialization techniques is a gradient distribution with equal variance across layers. If the first layer receives much smaller gradients than the last layer, we will have difficulties in choosing an appropriate learning rate.

In this blog we will analyze the different initialization method based on our neural network with no activation function first. This helps to compare different techniques regardless of specific function used in the network.

Constant Initialization

The first initialization we can consider is to initialize all the weights with the same constant value. What would be the consequences ? Initializing the neural network with zeros lead to learn the same feature during training. In fact, any constant initialization scheme will perform very poorly.

Consider a neural network with two hidden units, and assume we initialize all the biases to 0 and the weights with some constant α. If we forward propagate an input (x1,x2) in this network, the output of both hidden units will be relu(αx1+αx2). Thus, both hidden units will have identical influence on the cost, which will lead to identical gradients.

This means that during backpropagation, the weights will be updated in the same way, causing the hidden units to evolve symmetrically throughout training. As a result, the network cannot learn diverse features, significantly limiting its capacity to model complex patterns in the data.

Lets visualize the weights and gradients with following code:

The referenced code is from the course https://uvadlc-notebooks.readthedocs.io/.

To find the notebook containing full source code, visit the above link.

The Neural Network Model used for this experiment is defined using following code.

classBaseNetwork(nn.Module):

	def __init__(self, act_fn, input_size=784, num_classes=10, hidden_sizes=[512, 256, 256, 128]):
	"""        Inputs:            act_fn - Object of the activation function that should be used as non-linearity in the network.            input_size - Size of the input images in pixels            num_classes - Number of classes we want to predict            hidden_sizes - A list of integers specifying the hidden layer sizes in the NN        """super().__init__()
	
	# Create the network based on the specified hidden sizeslayers = []
	        layer_sizes = [input_size] + hidden_sizes
	for layer_indexin range(1, len(layer_sizes)):
	            layers += [nn.Linear(layer_sizes[layer_index-1], layer_sizes[layer_index]),
	                       act_fn]
	        layers += [nn.Linear(layer_sizes[-1], num_classes)]
	        self.layers = nn.ModuleList(layers)# A module list registers a list of modules as submodules (e.g. for parameters)self.config = {"act_fn": act_fn.__class__.__name__, "input_size": input_size, "num_classes": num_classes, "hidden_sizes": hidden_sizes}
	
	def forward(self, x):
	        x = x.view(x.size(0), -1)
	for lin self.layers:
	            x = l(x)
	return x

Gradient Distribution in Zero-Initialized Networks

When weights are initialized to zero:

$$ W^{(l)} = 0 \quad \text{for all } l $$

Since all neurons in a layer start with the same weights (zero), they will produce identical outputs and receive identical gradients during backpropagation.

In the observation, we can see that the first layer has diverse gradient, as the input data propagates through first layer without weight symmetry issues. The gradient is computed directly based on the data which has non zero variance. Following the first layer, the immediate layers has highly concentrated gradients often at or near 0. This occurs because when weights are initialized to zero, all neurons in these layers produce the same outputs during forward propagation.

Consequently:

  • Gradients are identical for all weights in the layer during backpropagation.

  • This leads to the collapse of parameter updates, making the network effectively useless for learning.

    download.svg

Activation Distribution in Zero Initialized Network:

When W(l)=0, all neurons in the same layer receive the same inputs and have the same bias (assuming biases are also initialized to zero). This results in identical activations for all neurons in a layer. causing Loss of Representation Power: All neurons in a layer produce the same output, meaning they cannot capture different features of the input. This essentially collapses the network’s expressive power.

download.svg

Also there are different issues with constant initialization like:

  • A too-large initialization leads to exploding gradients
  • A too-small initialization leads to vanishing gradients

Constant variance

The question remains How do we find the appropriate initialization values?

If we try to initialize parameters by randomly sampling from a distribution like a Gaussian, the most intuitive way would be to choose one variance that is used for all the layers in network.

def var_init(model, std=0.01):
for name, param in model.named_parameters():
        param.data.normal_(std=std)
        
var_init(model, std=0.01)
visualize_activations(model, print_variance=True)

The obtained output is:

download.svg

Layer 0 - Variance: 0.0832359567284584
Layer 2 - Variance: 0.003468977753072977
Layer 4 - Variance: 0.00021434960945043713
Layer 6 - Variance: 0.00011077219096478075
Layer 8 - Variance: 0.00011390676081646234

The variance goes on diminishing as we move across layers and almost vanishes in last layer.

If we use higher activation, the activations are likely to explore. So it’s actually harder to find one that gives us a good activation distribution across layers. The optimal value also depend on our network structure as well.

tutorial_notebooks_tutorial4_Optimization_and_Initialization_29_0.svg

As a next step, we will try to find the optimal initialization from the perspective of the activation distribution. For this, we state two requirements:

  1. The mean of the activations should be zero
  2. The variance of the activations should stay the same across every layer

Under these two assumptions, the backpropagated gradient signal should not be multiplied by values too small or too large in any layer. It should travel to the input layer without exploding or vanishing.

To achieve the given requirements, we we should initialize the weight distribution with a variance of the inverse of the input dimension d_x.

For the mathematical derivation visit: https://www.deeplearning.ai/ai-notes/initialization/index.html

Implementation:

def equal_var_init(model):
for name, paramin model.named_parameters():
if name.endswith(".bias"):
            param.data.fill_(0)
else:
            param.data.normal_(std=1.0/math.sqrt(param.shape[1]))
            
equal_var_init(model)
visualize_weight_distribution(model)
visualize_activations(model, print_variance=True)
            

download.svg

download.svg

download.svg

Now we can see the constant variation even if we move across the layers. This helps in stablising the learning process of neural networks.

Besides the variance of the activations, another variance we would like to stabilize is the one of the gradients. This ensures a stable optimization for deep networks. It turns out that we can do the same calculation as above starting from Δx=WΔy , and come to the conclusion that we should initialize our layers with 1/dy where dy is the number of output neurons.

As a compromise between both constraints, Glorot and Bengio (2010) proposed to use the harmonic mean of both values. This leads us to the well-known Xavier initialization:

Xavier initialization uses

$$ Var(W_{ji}) = \frac{2}{d_x + d_y} $$

where dy is the number of neurons in the output layer. This ensures the variance of both the forward and backward pass remains balanced, maintaining activation and gradient distributions across layer. Observing the gradient distribution using Xavier’s intializaiton, We see that the Xavier initialization balances the variance of gradients and activations.

  • With Var(Wji)=1/dx
    • Gradient histograms show a sharp, narrow peak around 0, indicating smaller gradient magnitudes, especially in deeper layers.
  • With Xavier Initialization:
    • Gradient histograms have a broader, more balanced distribution, showing larger gradients compared to the 1/dx initialization.

Performance of Xavier Initialization with different activation functions:

Xavier initialization works perfectly with linear activations because there is no non-linearity to distort the variance of activations or gradients. The weight initialization ensures that the variance is preserved across layers, resulting in smooth gradient flow without the risk of vanishing or exploding gradients. This makes it ideal for networks using linear activations.

Tanh Activation (tanh⁡(x))

Xavier initialization is effective for tanh activations, especially in shallow to moderately deep networks. It keeps the input to the tanh function centered around zero, where the gradient is steepest, helping avoid vanishing gradients. However, for very deep networks, tanh still suffers from the issue of vanishing gradients because of its squashing nature, making it less effective in deeper layers.

ReLU Activation (max⁡(0,x))

ReLU activation does not benefit as much from Xavier initialization. Since ReLU sets half of the inputs to zero, it doesn’t maintain a mean of zero for its activations, violating the assumptions of Xavier initialization. This results in sparse activations, leading to reduced variance as the network deepens. As a result, ReLU networks may experience poor gradient flow in deeper layers when using Xavier.

He initialization, which scales weights by 2/dx, is a better choice for ReLU, as it accounts for the sparsity and ensures better gradient propagation throughout the network. This gives us the Kaiming initialization (see He, K. et al. (2015))

Conclusion

In deep learning, the role of weight initialization cannot be overstated. Proper initialization ensures that the network learns effectively by maintaining healthy gradient flow during both forward and backward passes. As we’ve discussed, different activation functions require different strategies for initialization to prevent issues like vanishing or exploding gradients.

Ultimately, choosing the right initialization method is crucial for optimizing the training process and ensuring that a network converges efficiently. Understanding how initialization interacts with activation functions helps in designing better-performing deep neural networks.

References

  1. UVa Deep Learning Course Notebooks - Optimization and Initialization

    Available at: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial4/Optimization_and_Initialization.html

  2. DeepLearning.AI - AI Notes: Initialization

    Available at: https://www.deeplearning.ai/ai-notes/initialization/index.html

  3. Blog Post by Pouannes: Initialization

    Available at: https://pouannes.github.io/blog/initialization/#mjx-eqn-eqfwd_K