PyTorch Gradient Pitfalls: Lessons from Building a Word2Vec Training Loop
PyTorch Gradient Pitfalls: Lessons from Building a Word2Vec Training Loop
A practical debugging journey through common PyTorch gradient issues, from
NoneTypeerrors to vanishing leaf nodes.
Background
While implementing a Word2Vec-style embedding model from scratch using raw PyTorch (without nn.Module), several non-obvious gradient-related bugs surfaced. This post documents each issue, explains the root cause, and provides the correct fix.
The training loop in question looked like this:
for rev_i, review in enumerate(input_dataset * iterations):
for target_i in range(len(review)):
target_samples = [review[target_i]] + concatenated[(torch.rand(negative) * len(concatenated)).long().tolist()].tolist()
left_context = review[max(0, target_i-windows):target_i]
right_context = review[target_i+1:min(len(review),target_i+windows)]
input_context = left_context + right_context
embed_target = torch.mean(embed[input_context], dim=0)
l2 = torch.sigmoid(embed_target.matmul(W1[target_samples].T))
loss = (l2 - layer_2_target).pow(2).sum()
loss.backward()
with torch.no_grad():
embed -= alpha * embed.grad
W1 -= alpha * W1.grad
embed.grad.zero_()
W1.grad.zero_()
if rev_i % 100 == 0:
print(f"Iteration {rev_i} completed.")
print("Similar to 'terrible':", similar("terrible"))
Bug 1: TypeError: unsupported operand type(s) for *: 'float' and 'NoneType'
Error
embed -= alpha * embed.grad
TypeError: unsupported operand type(s) for *: 'float' and 'NoneType'
Root Cause
embed.grad is None because embed was created without requires_grad=True. Without this flag, PyTorch never tracks operations on the tensor, so no gradient is ever computed during loss.backward().
Fix
# ❌ Before — no gradient tracking
embed = torch.randn(vocab_size, embed_size)
W1 = torch.randn(vocab_size, embed_size)
# ✅ After — gradients will be computed
embed = torch.randn(vocab_size, embed_size, requires_grad=True)
W1 = torch.randn(vocab_size, embed_size, requires_grad=True)
Correct Update Pattern
loss.backward() # must be OUTSIDE no_grad
with torch.no_grad():
embed -= alpha * embed.grad # safe in-place update
W1 -= alpha * W1.grad
embed.grad.zero_() # clear for next iteration
W1.grad.zero_()
Bug 2: Why Does (torch.rand(..., requires_grad=True) - 0.5) * 0.2 Lose Its Gradient?
The Surprising Behavior
w1 = (torch.rand((5, 5), requires_grad=True) - 0.5) * 0.2
w2 = torch.rand((5, 5), requires_grad=True)
print(w1.requires_grad) # True — misleading!
print(w1.is_leaf) # False ← the real problem
print(w2.is_leaf) # True ✅
Even though w1.requires_grad is True, PyTorch only retains gradients for leaf tensors by default. w1 is the result of arithmetic operations, making it an intermediate node in the computation graph — not a leaf.
Why Only Leaf Tensors?
PyTorch’s autograd engine is designed to save memory. Intermediate activations in a deep network would consume enormous amounts of memory if gradients were retained for all of them. Only leaf tensors (parameters you actually want to update) retain .grad automatically.
You can force gradient retention on non-leaf tensors with .retain_grad(), but that is rarely what you want for weight initialization.
Initialization Approaches Compared
| Approach | is_leaf |
.grad available |
|---|---|---|
torch.rand((5,5), requires_grad=True) |
✅ True | ✅ |
(torch.rand((5,5), requires_grad=True) - 0.5) * 0.2 |
❌ False | ❌ |
((torch.rand((5,5)) - 0.5) * 0.2).requires_grad_(True) |
✅ True | ✅ |
torch.nn.Parameter((torch.rand((5,5)) - 0.5) * 0.2) |
✅ True | ✅ |
Correct Ways to Initialize with Scaling
# Method 1: compute first, then attach gradient tracking
embed = ((torch.rand(vocab_size, embed_size) - 0.5) * 0.2)
embed.requires_grad_(True) # now a leaf node ✅
# Method 2: use detach() to sever from computation graph
embed = ((torch.rand(vocab_size, embed_size) - 0.5) * 0.2).detach().requires_grad_(True)
# Method 3: wrap with nn.Parameter (preferred in nn.Module)
embed = torch.nn.Parameter((torch.rand(vocab_size, embed_size) - 0.5) * 0.2)
Bug 3: Empty input_context Causes a Runtime Error
When target_i is at the very start or end of a review, both left_context and right_context can be empty lists, making input_context = []. Indexing an embedding matrix with an empty list crashes:
embed_target = torch.mean(embed[input_context], dim=0) # ❌ if input_context == []
Fix
input_context = left_context + right_context
if len(input_context) == 0:
continue
Bug 4: layer_2_target Dimension Must Match target_samples
loss = (l2 - layer_2_target).pow(2).sum()
l2 has shape (1 + negative,) — one positive sample plus negative noise samples. layer_2_target must be the same shape with the label 1 at index 0 and 0 elsewhere:
layer_2_target = torch.zeros(1 + negative)
layer_2_target[0] = 1.0 # positive sample label
Getting this wrong produces a silent broadcasting error or a shape mismatch crash.
Summary Checklist
✅ Tensors you want to differentiate must be created with requires_grad=True
✅ Apply scaling BEFORE setting requires_grad, or use requires_grad_() after
✅ Check is_leaf=True before assuming .grad will be populated
✅ Guard against empty context windows with an early continue
✅ Ensure layer_2_target shape matches the number of (positive + negative) samples
✅ Call loss.backward() outside torch.no_grad()
✅ Perform weight updates and grad.zero_() inside torch.no_grad()
Key Takeaway
The single most important rule when working with raw PyTorch autograd:
Only leaf tensors retain gradients. A tensor produced by any arithmetic operation is not a leaf, even if
requires_grad=True.
When in doubt, print tensor.is_leaf alongside tensor.requires_grad — they tell very different stories.