GPU
GPU
Troubleshooting: “Failed to initialize NVML: Driver/library version mismatch”
The Problem
When running nvidia-smi, you may encounter the following error:
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 580.126
This means the NVIDIA kernel driver loaded in memory and the user-space NVML library installed on disk are running different versions. The two components must be in sync for nvidia-smi and any GPU workload to function.
Why Does This Happen?
| Cause | Details |
|---|---|
| Driver updated without reboot | The new library is on disk, but the old kernel module is still loaded in memory. |
apt upgrade silently updated the driver |
Package managers can pull in a newer driver as a dependency without an explicit reboot prompt. |
Manual .run installer conflicts with package manager |
Two installation methods leave mismatched files on the system. |
| Kernel update broke DKMS rebuild | A new Linux kernel was installed but DKMS failed to recompile the NVIDIA kernel module for it. |
Quick Diagnosis
# Check the kernel module version currently loaded
cat /proc/driver/nvidia/version
# Check the user-space library version on disk
ls -la /usr/lib/x86_64-linux-gnu/libnvidia-ml.so*
If the two versions differ, that confirms the mismatch.
Fix 1 — Reboot (Simplest)
sudo reboot
A reboot forces the kernel to load the module that matches the on-disk library. This resolves the issue in the vast majority of cases.
Fix 2 — Reload the Kernel Module Without Rebooting
Use this on production servers where a reboot is not immediately possible.
# 1. List processes using the GPU
sudo lsof /dev/nvidia*
# 2. Stop services that hold the GPU (examples)
sudo systemctl stop docker
sudo systemctl stop gdm # display manager, if running a desktop
# 3. Unload the NVIDIA kernel modules in dependency order
sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia
# 4. Reload the matching module
sudo modprobe nvidia
# 5. Verify
nvidia-smi
Note: If
rmmodfails with “Module is in use”, you still have a process holding the device. Kill or stop it first.
Fix 3 — Reinstall the Driver
If the module and library are both corrupt or partially installed, a clean reinstall is the safest path.
# Remove existing packages
sudo apt-get purge nvidia-*
# If the driver was installed via a .run file instead:
# sudo /usr/bin/nvidia-uninstall
# Install the desired version
sudo apt-get update
sudo apt-get install nvidia-driver-580
# Reboot to load the new module
sudo reboot
Fix 4 — Rebuild the DKMS Module
When a kernel update leaves the NVIDIA module uncompiled:
# Check DKMS status
dkms status
# Rebuild all modules for the running kernel
sudo dkms autoinstall
sudo reboot
Prevention
Lock the driver package so routine upgrades do not silently change the version:
sudo apt-mark hold nvidia-driver-580
This keeps apt upgrade from touching the driver until you explicitly unhold it:
sudo apt-mark unhold nvidia-driver-580
Key Takeaway
The error is almost always caused by an updated library with a stale kernel module. A simple sudo reboot fixes it in most cases. For environments that cannot reboot, manually unloading and reloading the kernel module (rmmod / modprobe) is the next best option.