Understanding Conv2d: How Multi-Channel Convolution Really Works
Understanding Conv2d: How Multi-Channel Convolution Really Works
A deep dive into PyTorch’s
nn.Conv2d— parameter semantics, multi-channel mechanics, and why convolution outputs look like grayscale images.
Conv2d Parameters Explained
nn.Conv2d is the workhorse of convolutional neural networks. Its constructor signature:
nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, bias=True)
| Parameter | Meaning |
|---|---|
in_channels |
Number of channels in the input feature map (e.g., 3 for an RGB image) |
out_channels |
Number of channels in the output — equivalently, the number of independent convolutional filters |
kernel_size |
Spatial size of each filter (e.g., 3 means 3×3) |
stride |
Step size of the sliding window; controls the spatial resolution of the output |
padding |
Number of zero-padded pixels added to each edge of the input; commonly used to preserve spatial dimensions |
bias |
Whether to add a learnable scalar bias per output channel (default: True) |
Walkthrough: 3-Channel Input → 6-Channel Output with a 3×3 Kernel
Consider a concrete example:
conv = nn.Conv2d(in_channels=3, out_channels=6, kernel_size=3, stride=1, padding=0)
Input shape: (batch, 3, 32, 32) — a batch of 32×32 RGB images.
Shape of the Weight Tensor
Each filter must span all input channels, so a single filter has shape:
\[\text{filter shape} = (\text{in\_channels},\ k_H,\ k_W) = (3, 3, 3)\]Since we want 6 output channels, there are 6 such filters. The full weight tensor is:
conv.weight.shape = (6, 3, 3, 3)
↑ ↑ ↑ ↑
out_ch in_ch kH kW
How a Single Filter Produces One Output Channel
A single filter (shape 3 × 3 × 3) slides across the input and, at every spatial position, performs:
Visually:
Input (3, 32, 32)
├─ Channel 0 (32×32) × filter[0] (3×3) → partial sum₀
├─ Channel 1 (32×32) × filter[1] (3×3) → partial sum₁
└─ Channel 2 (32×32) × filter[2] (3×3) → partial sum₂
│
element-wise sum + bias
↓
1 output feature map (30×30)
The three per-channel partial results are summed element-wise into a single 2D map. That is why one filter, regardless of how many input channels there are, always produces exactly one output channel.
Six Filters → Six Output Channels
Each of the 6 filters independently repeats the process above, yielding 6 feature maps that are stacked along the channel dimension:
Input: (batch, 3, 32, 32)
│
│ 6 filters, each (3, 3, 3)
▼
Conv2d(3, 6, 3, stride=1, padding=0)
│
▼
Output: (batch, 6, 30, 30)
Output Spatial Size Formula
\[H_{out} = \frac{H_{in} - k + 2p}{s} + 1\]With our parameters ($H_{in}=32,\ k=3,\ p=0,\ s=1$):
\[H_{out} = \frac{32 - 3 + 0}{1} + 1 = 30\]Setting padding=1 would keep the spatial size unchanged:
Parameter Count
\[\text{params} = \underbrace{6 \times 3 \times 3 \times 3}_{\text{weights} = 162} + \underbrace{6}_{\text{biases}} = 168\]Why Does a Convolution Output Look Like a Grayscale Image?
After convolution, each output channel is a single-channel 2D map — it is inherently grayscale. This often surprises people who expect to see color after convolving an RGB image.
Root Cause
A convolution filter sums across all input channels at every spatial location. The R, G, and B information is fused into one scalar value per position. The result no longer separates color — it encodes a learned feature response.
Do 6 Output Channels Equal Two RGB Images?
No. The 6 channels are not two groups of (R, G, B). Each channel is an independent feature detector trained to respond to different patterns:
Input (3 channels = R, G, B) → human-interpretable color
Output (6 channels = f₀, f₁, …, f₅) → abstract features with no color semantics
For example, after training, individual channels may respond to horizontal edges, vertical edges, color gradients, textures, and so on.
How to Visualize Convolution Outputs
| Method | Description |
|---|---|
| Per-channel grayscale | Display each of the 6 channels as a separate grayscale image (most common) |
| Pseudo-color heatmap | Apply a colormap like jet or viridis to a single channel to show activation intensity |
| Pick 3 channels as RGB | Map any 3 channels to R/G/B for a false-color composite (colors have no physical meaning) |
import matplotlib.pyplot as plt
# output shape: (6, 30, 30)
fig, axes = plt.subplots(1, 6, figsize=(18, 3))
for i in range(6):
axes[i].imshow(output[i].detach().numpy(), cmap='gray')
axes[i].set_title(f'Channel {i}')
axes[i].axis('off')
plt.tight_layout()
plt.show()
Key Takeaways
- One filter spans all input channels — its depth always equals
in_channels. out_channels= number of filters — each produces one output feature map.- Cross-channel summation is why the output is single-valued (grayscale) per channel; color information is encoded in weights, not preserved visually.
padding = kernel_size // 2(withstride=1) is the common recipe to keep spatial dimensions unchanged.- Visualize feature maps one channel at a time — stacking them into RGB is misleading unless done intentionally for analysis.