CUDA Error: An Illegal Memory Access Was Encountered With ... Home » Code 700 Reason An Illegal Memory Access Was Encountered » CUDA Error: An Illegal Memory Access Was Encountered With ... Maybe your like Code 73 Medecine Code 7500 Code 76 B Mercedes Code 7700 Ase Code 7702 CUDA error: an illegal memory access was encountered with reproduction Oliver (Oliver) October 31, 2021, 2:00pm 1 I have been getting this error. It suggested I could report an issue if the following code gets the same error, which it does: import torch torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = False torch.backends.cudnn.allow_tf32 = True data = torch.randn([25, 128, 63, 63], dtype=torch.float, device='cuda', requires_grad=True) net = torch.nn.Conv2d(128, 128, kernel_size=[2, 2], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1) net = net.cuda().float() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize() What does this mean? How can I fix this? Thanks!! tom (Thomas V) October 31, 2021, 2:35pm 2 What PyTorch and CUDA version and what hardware are you using? On my self-compiled recent git checkout (1.11.0a0+git0a07488) and cuda 11.3 I don’t get any error on a RTX3090. Best regards Thomas Oliver (Oliver) October 31, 2021, 4:21pm 3 Hi Thomas, I’m actually glad to hear, probably issue with my code then nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Wed_Jul_22_19:09:09_PDT_2020 Cuda compilation tools, release 11.0, V11.0.221 Build cuda_11.0_bu.TC445_37.28845127_0 pytorch version: 1.9.1 would you know what this error actually means? thanks! Oliver (Oliver) October 31, 2021, 4:34pm 4 Hmm, I just tried and I have the same error on colab (previously was on kaggle) funky… tom (Thomas V) October 31, 2021, 5:16pm 5 Try to use the latest PyTorch (1.10). The error indicates an out of bound memory access similar to a segfault on the CPU, something like an indexing error in the low level code. You do reproduce this with the exact code you posted? Best regards Thomas Oliver (Oliver) October 31, 2021, 5:27pm 6 Yes, I think I at least now know what is causing this, though the error seems misleading. I train a model (GAN) with a batch size larger than 4, I get this error. Unless I hard reset (factory reset the runtime), even after restarting the kernel and clearing all caches will cause anything related with the GPU to output this error. It’s not just the code above, just using .to(‘cuda’) will suffice. It’s funky, but I’m also just really learning pytorch now. Maybe it’s because I keep the image arrays in memory (about 700 MB total) instead of loading each image in in batches? tom (Thomas V) October 31, 2021, 5:38pm 7 So these CUDA errors should be fatal for the Python process they happen in but restarting the kernel should then make things work again. The perhaps most typical way to trigger it in a working PyTorch system is to have classification targets that exceed that possible through the logit tensor size passed to CrossEntropyLoss. But something seems funny with your system, then. At this point, maybe @ptrblck knows more… Best regards Thomas Oliver (Oliver) October 31, 2021, 6:00pm 8 It’s weird, though at this point it’s two seperate systems kaggle and colab. At least it works now with running on smaller batch sizes, in the next few days I may try to just load from disk. Oliver (Oliver) October 31, 2021, 6:00pm 9 Either way, thanks for looking into it with me, it’s very much appreciated! ptrblck October 31, 2021, 8:15pm 10 Oliver: I train a model (GAN) with a batch size larger than 4, I get this error. Do you see the illegal memory access when running the original GAN training from a clean and working environment? If so, could you post an executable code snippet, which would reproduce the issue and post the output of python -m torch.utils.collect_env, please? Oliver: It suggested I could report an issue if the following code gets the same error, which it does: Does this minimal conv/cuDNN code snippet also yield the illegal memory access in a clean environment or only after you’ve already hit the previous error in the GAN training? Oliver: Unless I hard reset (factory reset the runtime) Could you explain how you are performing the “factory reset”? tom: So these CUDA errors should be fatal for the Python process they happen in but restarting the kernel should then make things work again. Yes, that’s correct and if you are running into asserts (such as in the nn.CrossEntropyLoss use case) the CUDA context would be corrupted and restarting the Python kernel should work. Oliver (Oliver) November 1, 2021, 5:31am 11 ptrblck: python -m torch.utils.collect_env Hi, thanks To 1.: Collecting environment information... PyTorch version: 1.9.0+cu111 Is debug build: False CUDA used to build PyTorch: 11.1 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final) CMake version: version 3.12.0 Libc version: glibc-2.26 Python version: 3.7 (64-bit runtime) Python platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic Is CUDA available: True CUDA runtime version: 11.1.105 GPU models and configuration: GPU 0: Tesla P100-PCIE-16GB Nvidia driver version: 460.32.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5 /usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip3] numpy==1.19.5 [pip3] torch==1.9.0+cu111 [pip3] torchsummary==1.5.1 [pip3] torchtext==0.10.0 [pip3] torchvision==0.10.0+cu111 [conda] Could not collect to 2 and 3) This is confusing to me, restarting the kernel, etc, I believe should result in a clean environment. Maybe that is not the case on Kaggle/Colab or my knowledge is just too poor. However, after restarting the error persists on both platforms. On Kaggle there is no fix unless to completely shut off and turn on (i’m guessing this switches the gpu too) Restarting the kernel/runtime doesn’t help and henceforth anything will result in this error that has a call to GPU. On Colab there is an option under Runtime called Factory Reset Runtime. It clears everything. This works. So yes, I do encounter it in a clean environment but only after I hit the previous error in a previous session even after restarting. If I start with batch size <= 4 all is fine. ptrblck November 1, 2021, 8:19am 12 Thanks for the follow-up. I’m not familiar with how Kaggle kernels work, but based on your description it doesn’t seem to be sufficient to restart the kernel. In any case, once you’ve “completely shut off” the Kaggle runtime, what kind of error are you seeing when running the code the first time? Oliver (Oliver) November 1, 2021, 8:41am 13 Thanks for helping! Sorry I was unclear. If I completely shut off and start I likely get a new gpu assigned, no error. Like colab’s factory reset. Could it be the issue that I’m storing the whole array in memory? That’s the only major thing that is different this time arround, I repurposed old code I made which worked fine but loaded images from disk. I’ll try that next with this dataset too. Edit Update: Get the same with reading from disk, weird. I think I have issue in my code. Above it points at: allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag something similar in tensorflow woudl mean the graph is disconnected. What I can’t make out is how this relates to batch_size Edit update 2: Actually can do up to batch_size 6 Oliver (Oliver) November 4, 2021, 2:08pm 14 Update 3: I still get the error Also, batch_size 6 sometimes does it (works for a few epochs I think and eventually fails or maybe it’s really just sometimes) This is so weird. To circle back to my initial suspicions Tensorflow has a similar error msg and it would mean that the graph is disconnected. Do you think this could be the case? How does batch size fit into that? This is where it happens: <ipython-input-21-5f2cf50439c6> in train(save_model) 34 gen_loss = get_gen_loss(gen, disc, mask, image, adv_criterion, recon_criterion, 1000) 35 gen_loss.backward() ---> 36 gen_opt.step() 37 38 # Keep track of the average discriminator loss /usr/local/lib/python3.7/dist-packages/torch/optim/optimizer.py in wrapper(*args, **kwargs) 86 profile_name = "Optimizer.step#{}.step".format(obj.__class__.__name__) 87 with torch.autograd.profiler.record_function(profile_name): ---> 88 return func(*args, **kwargs) 89 return wrapper 90 /usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs) 26 def decorate_context(*args, **kwargs): 27 with self.__class__(): ---> 28 return func(*args, **kwargs) 29 return cast(F, decorate_context) 30 /usr/local/lib/python3.7/dist-packages/torch/optim/adam.py in step(self, closure) 116 lr=group['lr'], 117 weight_decay=group['weight_decay'], --> 118 eps=group['eps']) 119 return loss /usr/local/lib/python3.7/dist-packages/torch/optim/_functional.py in adam(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, beta1, beta2, lr, weight_decay, eps) 85 # Decay the first and second moment running average coefficient 86 exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) ---> 87 exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2) 88 if amsgrad: 89 # Maintains the maximum of all 2nd moment running avg. till no w Thanks a lot, guys, it works with batch_size 4 but not knowing what is causing this makes me nuts. ptrblck November 5, 2021, 6:55am 15 Oliver: Do you think this could be the case? No, I don’t think so. Based on your description it rather sounds as if you are running out of memory and then getting false errors due to a sticky error. However, I’m still unsure which error message you get the first time after running your code in a clean and working environment. Oliver (Oliver) November 5, 2021, 7:59am 16 Oh, I’m sorry, I seem to have misunderstood what you meant (asked about) earlier. The first time I run it with a batch_size > 4 gives me this error: RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. It stays the same for all gpu related calls unless factory reset in colab even when restarting the kernel. ptrblck November 5, 2021, 8:35am 17 OK, thanks. Could you post an executable code snippet to reproduce the issue, please? In case you are using a custom dataset, please post the input shapes which would be needed to execute the training and run into the issue. tanweer-mahdi (Shah Mahdi Hasan ) March 29, 2022, 9:29pm 19 I am getting the same error within a Kaggle kernel. I was able to train the model successfully, however, during inference I receive this error. I have noticed that substantially lowering the number of epochs does not throw the error but I cannot proceed with this approach because the model has way too high bias with that few number of epochs. I used torch.cuda.empty_cache() thinking that it is possibly originating from excessive memory usage. This is the result of !nvidia-smi post that: Tue Mar 29 11:33:01 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 | | N/A 40C P0 37W / 250W | 997MiB / 16280MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ Is there any solution to these error? I am participating in a competition but have not been able to submit any prediction in days due to this error. Many thanks. ptrblck March 29, 2022, 10:57pm 20 An illegal memory access error won’t be raised if you are running out of memory. One recommended approach is to update PyTorch to the latest release with the latest library stack and to check if this was a known and already fixed issue. Currently it would be the nightly binaries with the cUDA 11.5 runtime. 2 Likes Tag » Code 700 Reason An Illegal Memory Access Was Encountered Incidental Error 700 - An Illegal Memory Access Is Encountered Simple CUDA Test Always Fails With "an Illegal Memory Access Was ... CUDA Error 700: An Illegal Memory Access Was Encountered #1946 Cuda Runtime Error (700) : An Illegal Memory Access Was Encountered CUDA Error In :465 : An Illegal Memory Access Was ... Illegal Memory Access Problem CUDA - GPU - Julia Discourse An Empirical Method Of Debugging "illegal Memory Access" Bug In ... Getting Cuda Error 700 Without Any Obvious Reason - ADocLib PyTorch RuntimeError: CUDA Error: An Illegal Memory Access Was ... An Illegal Memory Access Was Encountered(CUDA错误非法访问内存) Pytorch报错:CUDA Error: An Illegal Memory Access Was Encountered An Illegal Memory Access Was Encountered” A Few Times This Past Day. CUDA Error: An Illegal Memory Access Was Encountered - Part 1 ...