Building Better Neural Networks: ReLU and Dropout in a Three-Layer Architecture

A three-layer neural network first uses ReLU to introduce non-linearity (solving the limited expressiveness problem), then applies Dropout for regularization (solving the overfitting problem).


1. The Problem with Linear Stacking

A neural network that only chains linear transformations is, mathematically, still just a single linear function — no matter how many layers you add:

\[f(x) = W_3 \cdot (W_2 \cdot (W_1 \cdot x + b_1) + b_2) + b_3 = W' \cdot x + b'\]

This means a purely linear multi-layer network has no more representational power than a single-layer one. To learn complex, real-world patterns we need non-linearity.


2. ReLU — Introducing Non-Linearity

What is ReLU?

ReLU (Rectified Linear Unit) is the most widely used activation function in modern deep learning:

\[\text{ReLU}(x) = \max(0, x)\]

It keeps positive values as-is and maps all negative values to zero.

Why ReLU?

Advantage Explanation
Computational efficiency Only a threshold comparison — much faster than sigmoid or tanh
Mitigates vanishing gradients Gradient is 1 for positive inputs, avoiding the saturation zones of sigmoid/tanh
Sparse activation Zeroing out negative values leads to sparse representations, which can improve generalization
Proven empirically Default choice in CNNs, Transformers, and most feedforward networks

A Caveat: Dying ReLU

If a neuron’s weights shift so that the input to ReLU is always negative, the gradient is perpetually zero and the neuron stops learning. Variants like Leaky ReLU (max(αx, x) with a small α) or GELU address this.


3. Dropout — Introducing Regularization

The Overfitting Problem

A network with sufficient parameters can memorize the training set perfectly — and then fail on unseen data. This is overfitting.

What is Dropout?

During each training step, Dropout randomly sets a fraction of neuron outputs to zero with probability $p$ (commonly 0.5 for hidden layers).

Training forward pass (p = 0.5):

Layer output: [0.8, 1.2, 0.3, 0.9, 0.5]
Dropout mask:  [ 1,   0,   1,   0,   1 ]   ← randomly generated
Result:        [0.8, 0.0, 0.3, 0.0, 0.5]

At inference time, Dropout is turned off, and outputs are scaled by $(1 - p)$ to compensate (or, equivalently, training outputs are scaled up by $\frac{1}{1-p}$ — called inverted dropout).

Why Does It Work?

  • Prevents co-adaptation: neurons cannot rely on specific partners, so each must learn robust features independently.
  • Implicit ensemble: each training step uses a different random sub-network. The final model is effectively an average of exponentially many thinned networks.
  • Lightweight: no extra parameters, minimal compute overhead.

4. Putting It Together: A Three-Layer Network

Below is a minimal PyTorch implementation demonstrating both ideas:

import torch
import torch.nn as nn

class ThreeLayerNet(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim, dropout_p=0.5):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden_dim),
            nn.ReLU(),               # non‑linearity
            nn.Dropout(p=dropout_p), # regularization

            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=dropout_p),

            nn.Linear(hidden_dim, out_dim),
        )

    def forward(self, x):
        return self.net(x)

Layer-by-layer Walkthrough

# Layer Purpose
1 LinearReLUDropout Project input into hidden space, add non-linearity, regularize
2 LinearReLUDropout Further transform features with the same strategy
3 Linear Final projection to output logits (no activation — loss function handles it)

Note: The last layer intentionally has no ReLU or Dropout. A ReLU would clip negative logits (harmful for classification), and Dropout at test time is turned off anyway — but applying it to the output layer during training can destabilize learning.


5. Key Takeaways

  1. ReLU solves expressiveness: without non-linear activations, deep networks collapse into shallow linear models.
  2. Dropout solves overfitting: by randomly silencing neurons during training, it forces the network to learn redundant, generalizable representations.
  3. Order matters: the typical pattern is Linear → Activation → Dropout, repeated per hidden layer.
  4. These two techniques are complementary — one expands what the model can learn; the other constrains it to learn only what generalizes.

References

  • Nair & Hinton, Rectified Linear Units Improve Restricted Boltzmann Machines, ICML 2010
  • Srivastava et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
  • Goodfellow, Bengio, & Courville, Deep Learning, Chapter 6–7, MIT Press 2016