Building Better Neural Networks: ReLU and Dropout in a Three-Layer Architecture
Building Better Neural Networks: ReLU and Dropout in a Three-Layer Architecture
A three-layer neural network first uses ReLU to introduce non-linearity (solving the limited expressiveness problem), then applies Dropout for regularization (solving the overfitting problem).
1. The Problem with Linear Stacking
A neural network that only chains linear transformations is, mathematically, still just a single linear function — no matter how many layers you add:
\[f(x) = W_3 \cdot (W_2 \cdot (W_1 \cdot x + b_1) + b_2) + b_3 = W' \cdot x + b'\]This means a purely linear multi-layer network has no more representational power than a single-layer one. To learn complex, real-world patterns we need non-linearity.
2. ReLU — Introducing Non-Linearity
What is ReLU?
ReLU (Rectified Linear Unit) is the most widely used activation function in modern deep learning:
\[\text{ReLU}(x) = \max(0, x)\]It keeps positive values as-is and maps all negative values to zero.
Why ReLU?
| Advantage | Explanation |
|---|---|
| Computational efficiency | Only a threshold comparison — much faster than sigmoid or tanh |
| Mitigates vanishing gradients | Gradient is 1 for positive inputs, avoiding the saturation zones of sigmoid/tanh |
| Sparse activation | Zeroing out negative values leads to sparse representations, which can improve generalization |
| Proven empirically | Default choice in CNNs, Transformers, and most feedforward networks |
A Caveat: Dying ReLU
If a neuron’s weights shift so that the input to ReLU is always negative, the gradient is perpetually zero and the neuron stops learning. Variants like Leaky ReLU (max(αx, x) with a small α) or GELU address this.
3. Dropout — Introducing Regularization
The Overfitting Problem
A network with sufficient parameters can memorize the training set perfectly — and then fail on unseen data. This is overfitting.
What is Dropout?
During each training step, Dropout randomly sets a fraction of neuron outputs to zero with probability $p$ (commonly 0.5 for hidden layers).
Training forward pass (p = 0.5):
Layer output: [0.8, 1.2, 0.3, 0.9, 0.5]
Dropout mask: [ 1, 0, 1, 0, 1 ] ← randomly generated
Result: [0.8, 0.0, 0.3, 0.0, 0.5]
At inference time, Dropout is turned off, and outputs are scaled by $(1 - p)$ to compensate (or, equivalently, training outputs are scaled up by $\frac{1}{1-p}$ — called inverted dropout).
Why Does It Work?
- Prevents co-adaptation: neurons cannot rely on specific partners, so each must learn robust features independently.
- Implicit ensemble: each training step uses a different random sub-network. The final model is effectively an average of exponentially many thinned networks.
- Lightweight: no extra parameters, minimal compute overhead.
4. Putting It Together: A Three-Layer Network
Below is a minimal PyTorch implementation demonstrating both ideas:
import torch
import torch.nn as nn
class ThreeLayerNet(nn.Module):
def __init__(self, in_dim, hidden_dim, out_dim, dropout_p=0.5):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_dim, hidden_dim),
nn.ReLU(), # non‑linearity
nn.Dropout(p=dropout_p), # regularization
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(p=dropout_p),
nn.Linear(hidden_dim, out_dim),
)
def forward(self, x):
return self.net(x)
Layer-by-layer Walkthrough
| # | Layer | Purpose |
|---|---|---|
| 1 | Linear → ReLU → Dropout |
Project input into hidden space, add non-linearity, regularize |
| 2 | Linear → ReLU → Dropout |
Further transform features with the same strategy |
| 3 | Linear |
Final projection to output logits (no activation — loss function handles it) |
Note: The last layer intentionally has no ReLU or Dropout. A ReLU would clip negative logits (harmful for classification), and Dropout at test time is turned off anyway — but applying it to the output layer during training can destabilize learning.
5. Key Takeaways
- ReLU solves expressiveness: without non-linear activations, deep networks collapse into shallow linear models.
- Dropout solves overfitting: by randomly silencing neurons during training, it forces the network to learn redundant, generalizable representations.
- Order matters: the typical pattern is
Linear → Activation → Dropout, repeated per hidden layer. - These two techniques are complementary — one expands what the model can learn; the other constrains it to learn only what generalizes.
References
- Nair & Hinton, Rectified Linear Units Improve Restricted Boltzmann Machines, ICML 2010
- Srivastava et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
- Goodfellow, Bengio, & Courville, Deep Learning, Chapter 6–7, MIT Press 2016