CUDA Error: An Illegal Memory Access Was Encountered With ...

Maybe your like

Oliver (Oliver) October 31, 2021, 2:00pm 1 I have been getting this error. It suggested I could report an issue if the following code gets the same error, which it does:What does this mean? How can I fix this?Thanks!! tom (Thomas V) October 31, 2021, 2:35pm 2 What PyTorch and CUDA version and what hardware are you using? On my self-compiled recent git checkout (1.11.0a0+git0a07488) and cuda 11.3 I don’t get any error on a RTX3090.Best regardsThomas Oliver (Oliver) October 31, 2021, 4:21pm 3 Hi Thomas,I’m actually glad to hear, probably issue with my code thenpytorch version: 1.9.1would you know what this error actually means?thanks! Oliver (Oliver) October 31, 2021, 4:34pm 4 Hmm, I just tried and I have the same error on colab (previously was on kaggle) funky… tom (Thomas V) October 31, 2021, 5:16pm 5 Try to use the latest PyTorch (1.10). The error indicates an out of bound memory access similar to a segfault on the CPU, something like an indexing error in the low level code. You do reproduce this with the exact code you posted?Best regardsThomas Oliver (Oliver) October 31, 2021, 5:27pm 6 Yes, I think I at least now know what is causing this, though the error seems misleading.I train a model (GAN) with a batch size larger than 4, I get this error. Unless I hard reset (factory reset the runtime), even after restarting the kernel and clearing all caches will cause anything related with the GPU to output this error. It’s not just the code above, just using .to(‘cuda’) will suffice. It’s funky, but I’m also just really learning pytorch now. Maybe it’s because I keep the image arrays in memory (about 700 MB total) instead of loading each image in in batches? tom (Thomas V) October 31, 2021, 5:38pm 7 So these CUDA errors should be fatal for the Python process they happen in but restarting the kernel should then make things work again. The perhaps most typical way to trigger it in a working PyTorch system is to have classification targets that exceed that possible through the logit tensor size passed to CrossEntropyLoss. But something seems funny with your system, then. At this point, maybe @ptrblck knows more…Best regardsThomas Oliver (Oliver) October 31, 2021, 6:00pm 8 It’s weird, though at this point it’s two seperate systems kaggle and colab. At least it works now with running on smaller batch sizes, in the next few days I may try to just load from disk. Oliver (Oliver) October 31, 2021, 6:00pm 9 Either way, thanks for looking into it with me, it’s very much appreciated! ptrblck October 31, 2021, 8:15pm 10

I train a model (GAN) with a batch size larger than 4, I get this error.

Do you see the illegal memory access when running the original GAN training from a clean and working environment? If so, could you post an executable code snippet, which would reproduce the issue and post the output of python -m torch.utils.collect_env, please?

It suggested I could report an issue if the following code gets the same error, which it does:

Does this minimal conv/cuDNN code snippet also yield the illegal memory access in a clean environment or only after you’ve already hit the previous error in the GAN training?

Unless I hard reset (factory reset the runtime)

Could you explain how you are performing the “factory reset”?

So these CUDA errors should be fatal for the Python process they happen in but restarting the kernel should then make things work again.

Yes, that’s correct and if you are running into asserts (such as in the nn.CrossEntropyLoss use case) the CUDA context would be corrupted and restarting the Python kernel should work. Oliver (Oliver) November 1, 2021, 5:31am 11

python -m torch.utils.collect_env

Hi, thanksTo 1.:to 2 and 3) This is confusing to me, restarting the kernel, etc, I believe should result in a clean environment. Maybe that is not the case on Kaggle/Colab or my knowledge is just too poor. However, after restarting the error persists on both platforms. On Kaggle there is no fix unless to completely shut off and turn on (i’m guessing this switches the gpu too) Restarting the kernel/runtime doesn’t help and henceforth anything will result in this error that has a call to GPU.On Colab there is an option under Runtime called Factory Reset Runtime. It clears everything. This works.So yes, I do encounter it in a clean environment but only after I hit the previous error in a previous session even after restarting.If I start with batch size <= 4 all is fine. ptrblck November 1, 2021, 8:19am 12 Thanks for the follow-up. I’m not familiar with how Kaggle kernels work, but based on your description it doesn’t seem to be sufficient to restart the kernel. In any case, once you’ve “completely shut off” the Kaggle runtime, what kind of error are you seeing when running the code the first time? Oliver (Oliver) November 1, 2021, 8:41am 13 Thanks for helping! Sorry I was unclear. If I completely shut off and start I likely get a new gpu assigned, no error. Like colab’s factory reset.Could it be the issue that I’m storing the whole array in memory? That’s the only major thing that is different this time arround, I repurposed old code I made which worked fine but loaded images from disk. I’ll try that next with this dataset too.Edit Update: Get the same with reading from disk, weird. I think I have issue in my code. Above it points at: allow_unreachable=True, accumulate_grad=True) # allow_unreachable flagsomething similar in tensorflow woudl mean the graph is disconnected. What I can’t make out is how this relates to batch_sizeEdit update 2: Actually can do up to batch_size 6 Oliver (Oliver) November 4, 2021, 2:08pm 14 Update 3: I still get the error Also, batch_size 6 sometimes does it (works for a few epochs I think and eventually fails or maybe it’s really just sometimes) This is so weird.To circle back to my initial suspicions Tensorflow has a similar error msg and it would mean that the graph is disconnected. Do you think this could be the case? How does batch size fit into that?This is where it happens:wThanks a lot, guys, it works with batch_size 4 but not knowing what is causing this makes me nuts. ptrblck November 5, 2021, 6:55am 15

Do you think this could be the case?

No, I don’t think so. Based on your description it rather sounds as if you are running out of memory and then getting false errors due to a sticky error. However, I’m still unsure which error message you get the first time after running your code in a clean and working environment. Oliver (Oliver) November 5, 2021, 7:59am 16 Oh, I’m sorry, I seem to have misunderstood what you meant (asked about) earlier. The first time I run it with a batch_size > 4 gives me this error:It stays the same for all gpu related calls unless factory reset in colab even when restarting the kernel. ptrblck November 5, 2021, 8:35am 17 OK, thanks. Could you post an executable code snippet to reproduce the issue, please? In case you are using a custom dataset, please post the input shapes which would be needed to execute the training and run into the issue. tanweer-mahdi (Shah Mahdi Hasan ) March 29, 2022, 9:29pm 19 I am getting the same error within a Kaggle kernel. I was able to train the model successfully, however, during inference I receive this error. I have noticed that substantially lowering the number of epochs does not throw the error but I cannot proceed with this approach because the model has way too high bias with that few number of epochs.I used torch.cuda.empty_cache() thinking that it is possibly originating from excessive memory usage. This is the result of !nvidia-smi post that:Is there any solution to these error? I am participating in a competition but have not been able to submit any prediction in days due to this error. Many thanks. ptrblck March 29, 2022, 10:57pm 20 An illegal memory access error won’t be raised if you are running out of memory. One recommended approach is to update PyTorch to the latest release with the latest library stack and to check if this was a known and already fixed issue. Currently it would be the nightly binaries with the cUDA 11.5 runtime.2 Likes