Understanding PyTorch transforms.Normalize and ToTensor: Common Pitfalls
Understanding PyTorch transforms.Normalize and ToTensor: Common Pitfalls
A walkthrough of three frequently misunderstood behaviors in torchvision.transforms.
1. Why Does Normalize(mean=[0.5, 0.5], std=[0.5, 0.5]) Fail While Normalize(mean=[0.5], std=[0.5]) Works?
The Rule
transforms.Normalize applies normalization per channel using the formula:
The length of mean and std must exactly match the number of channels C in the input tensor.
Reproducing the Error
import numpy as np
from torchvision import transforms
a = transforms.ToTensor()(np.array([[1., 2], [3, 4]]))
print(a.shape) # torch.Size([1, 2, 2]) -> C=1
# Works: mean/std length == C
transforms.Normalize(mean=[0.5], std=[0.5])(a)
# Fails: mean/std length (2) != C (1)
transforms.Normalize(mean=[0.5, 0.5], std=[0.5, 0.5])(a)
The input is a 2-D grayscale array. After ToTensor() the shape becomes [1, 2, 2] — 1 channel. Passing two values in mean/std tells PyTorch to expect 2 channels, causing a dimension mismatch error.
Rule of Thumb
| Image type | Expected mean/std length |
|---|---|
| Grayscale | [v] — 1 value |
| RGB | [v, v, v] — 3 values |
Always check tensor.shape and match the number of values accordingly.
2. Why Is mean=0.5, std=0.5 Hardcoded Instead of Being Computed from the Data?
Two Different Meanings of “Normalization”
There is an important distinction between statistical normalization and the fixed linear rescaling that transforms.Normalize performs.
Statistical normalization (computed from data):
\[z = \frac{x - \mu}{\sigma}, \quad \mu = \text{mean}(X),\ \sigma = \text{std}(X)\]transforms.Normalize(mean, std) (fixed constants you supply):
The transform does not compute anything from your data. It applies exactly the constants you pass in.
Why 0.5, 0.5 Is So Common
After ToTensor(), pixel values are typically in $[0, 1]$. Substituting mean=0.5, std=0.5:
This linearly maps $[0, 1] \to [-1, 1]$, a range that many training pipelines (especially GANs) prefer. It is a convenient choice, not a statistically derived one.
Using Real Dataset Statistics
To perform true statistical normalization you must compute mean and std offline over the entire training set and supply those values. Well-known precomputed constants:
| Dataset | mean |
std |
|---|---|---|
| MNIST | [0.1307] |
[0.3081] |
| CIFAR-10 | [0.4914, 0.4822, 0.4465] |
[0.2470, 0.2435, 0.2616] |
| ImageNet | [0.485, 0.456, 0.406] |
[0.229, 0.224, 0.225] |
Computing mean/std from a DataLoader
import torch
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
loader = DataLoader(
datasets.MNIST("data", train=True, download=True,
transform=transforms.ToTensor()),
batch_size=512, shuffle=False,
)
mean = torch.zeros(1)
var = torch.zeros(1)
n = 0
for images, _ in loader:
# images: [B, C, H, W]
b = images.size(0)
images_flat = images.view(b, images.size(1), -1) # [B, C, H*W]
mean += images_flat.mean(2).sum(0)
var += images_flat.var(2).sum(0)
n += b
mean /= n
std = torch.sqrt(var / n)
print(f"mean={mean.item():.4f}, std={std.item():.4f}")
# mean=0.1307, std=0.3081
3. Does ToTensor() Always Scale Values to [0, 1]?
Short Answer
No. The scaling behavior depends on the dtype of the input, not whether the variable “looks like” an image.
What ToTensor() Actually Does
a = transforms.ToTensor()(np.array([[0., 0, 0, 0, 0, 6, 7, 8, 9, 10]] * 10))
print(a)
# tensor([[[ 0., 0., 0., 0., 0., 6., 7., 8., 9., 10.],
# ...]])
# Values are still 0, 6, 7, 8, 9, 10 — NOT rescaled
Because np.array([...]) defaults to float64, ToTensor() only:
- Reorders dimensions from
[H, W]or[H, W, C]to[C, H, W] - Converts dtype to
torch.float32
It does not divide by 255.
When Does ToTensor() Scale to [0, 1]?
The /255 rescaling happens only when the input dtype is uint8, which is what PIL images produce.
# Scaled to [0, 1]: uint8 input
arr_uint8 = np.array([[0, 128, 255]], dtype=np.uint8)
t = transforms.ToTensor()(arr_uint8)
print(t) # tensor([[[0.0000, 0.5020, 1.0000]]])
# NOT scaled: float input
arr_float = np.array([[0., 128., 255.]])
t = transforms.ToTensor()(arr_float)
print(t) # tensor([[[ 0., 128., 255.]]])
Summary
| Input dtype | ToTensor() behavior |
|---|---|
uint8 (PIL image, np.uint8) |
Divides by 255, output in [0, 1] |
float32 / float64 |
Dimension reorder + type cast only, no scaling |
Forcing Rescaling When You Need It
# Option 1: cast to uint8 first
t = transforms.ToTensor()(arr.astype(np.uint8))
# Option 2: manually divide before converting
t = transforms.ToTensor()(arr.astype(np.float32) / 255.0)
Key Takeaways
Normalizeis per-channel — the length ofmean/stdmust equal the number of channelsC.Normalizedoes not compute statistics — you supply fixed constants.0.5is just a convenient rescaling from[0, 1]to[-1, 1], not a dataset statistic.ToTensorscales to[0, 1]only foruint8inputs — float arrays are passed through without rescaling.