[Solved] runtimeerror: cuda error: invalid device ordinal

The ‘runtimeerror: cuda error: invalid device ordinal’ error happens when the CUDA runtime finds an unrecognizable, non-existent, or unavailable GPU device. Scroll down to know why this error occurs and how you can prevent this on your system.

Contents

Reasons of error

You get the “runtimeerror: cuda error: invalid device ordinal” error because of a couple of reasons. Check this list out to know more:

Your system might not have a GPU, or if it exists, it may not be configured correctly. CUDA-compatible GPUs are necessary for a system, and thus, you need to check first.
The GPU ID that you have used for accessing is not correct. In this case, the GPU ID should be valid.
Your CUDA driver might not be updated. Outdated drivers are not able to recognize the GPU on your device.
Your CUDA environment may not be properly set. See if the CUDA_HOME directory has the CUDA installation details only.
If two conflicting libraries try to access the CUDA device simultaneously, it throws an error.
If your GPU usage exceeds what is available to you, you may get this error.

Methods of resolving the error

You can go through the following pointers in order to resolve the “runtimeerror: cuda error: invalid device ordinal” error.

Check if GPU actually exists on your system. For this, use the Nvidia-smi command-line tool. It lists the GPU on your system. It provides information about device IDs, status, and temperature. In case no such GPUs exist, you got the reason of your error. You need to have physical GPU hardware to run this command.
Your CUDA driver may need an update. Go to the NVIDIA website: https://www.nvidia.com/download/index.aspx in order to download the latest driver. You can choose your operating system and GPU model from the site itself.
The environment variable should have the correct CUDA installation directory. Use the following command to check this:

echo $CUDA_HOME

If you change CUDA configurations, driver installations, or environment variables in any way, consider restarting the system once.
If you don’t know which GPU you should point out in your command, you can simply write cuda. Otherwise, the indexing starts from 0, so you can count and verify.

torch.device('cuda')

Check GPU Availability

This code uses the nvidia_smi library to check if gpu device is present on your device or not. If it doesn’t exist, you’ll get an error.

import nvidia_smi

devices = nvidia_smi.nvml.DeviceGetAll()
if not devices:
    raise RuntimeError("No GPUs detected")

Set Device Ordinal/ GPU id

You have the feasibility to set the device ordinal all by yourself. For this purpose, the torch.cuda.set_device() function is present in the torch library of Python. Check the GPU that you want to use and replace it in the given code.

import torch

gpu_id = 0  # Replace with the desired GPU ordinal
torch.cuda.set_device(gpu_id)

Or you can explicitly mention it in this way as well:

var = EmotionRecognition(device='gpu', gpu_id=1)

Check CUDA availability

The presence of the CUDA environment is extremely important. The torch.cuda.is_available() under the torch library enables you to check whether CUDA is available or not.

import torch

if not torch.cuda.is_available():
    print("CUDA is not available")
    exit()

GPU Usage check

It is suggested to check if you have enough GPU hardware devices or not. The torch.cuda.device_count() of the torch library is an excellent way to check this. It draws a comparison with the desired GPUs.

import torch

num_gpus = torch.cuda.device_count()
if num_gpus < desired_gpus:
    print("Not enough GPUs available")
    exit()

runtimeerror: cuda error: invalid device ordinal PyTorch

The GPU id is counted from 0. Therefore, you should mention the GPU id correctly. If the device has only one GPU, then it is better to state 0. This refers to the first GPU. If you want to avoid the specifications, use torch.device(‘cuda’) directly.

Apart from this, check the CUDA_VISIBLE_DEVICES environment variable to know the details.

runtimeerror: cuda error: invalid device ordinal torchrun

While using torchrun if you are specifying the local rank using os.environ[“LOCAL_RANK”] and have set it equal to a GPU ID, check if you have done the right assignment

Other tips while working with CUDA and GPU

Set the device ordinal yourself if more than one GPU exists.
Work with try-except blocks for exception handling.
Nvidia-semi is an exception tool to monitor your memory usage.
If you plan to work on deep learning and scientific computing projects, CUDA libraries like cuDNN (CUDA Deep Neural Networks) and cuSPARSE (CUDA Sparse Matrix Library) are quite efficient.

FAQs

How do I update my CUDA drivers?

NVIDIA’s website has a list of CUDA drivers. You can install the one that suits your device.

Conclusion

This article covers the runtimeerror: cuda error: invalid device ordinal error in Python. It explains different reasons for the error and how you can resolve it.