Cuda Runtime Error (700) : An Illegal Memory Access Was Encountered
- Notifications You must be signed in to change notification settings
- Fork 26.4k
- Star 96.3k
- Code
- Issues 5k+
- Pull requests 1.7k
- Actions
- Projects 12
- Wiki
- Security
Uh oh!
There was an error while loading. Please reload this page.
- Insights
Description
🐛 Bug
I performed some torch.chunk and torch.cat operations on tensors (Conv2D feature maps) during the training of my object detection network, and the forward propagation constantly crash here. Related codes are:
centernesses = [centerness.sigmoid() for centerness in centernesses] centernesses = [centerness.permute(0, 2, 3, 1).reshape(-1) for centerness in centernesses] centernesses = [torch.chunk(centerness, num_imgs, dim = 0) for centerness in centernesses] #print('centernesses (in forward_nms):') #print(centernesses) centernesses = [ torch.cat(list(map(lambda x: x[i], centernesses)), dim = 0) for i in range(num_imgs)]The error messages are:
Traceback (most recent call last): File "./tools/train.py", line 109, in <module> main() File "./tools/train.py", line 104, in main logger=logger) File "/project/jhli/rzh/mmdetection_old/mmdet/apis/train.py", line 58, in train_detector _dist_train(model, dataset, cfg, validate=validate) File "/project/jhli/rzh/mmdetection_old/mmdet/apis/train.py", line 186, in _dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run epoch_runner(data_loaders[i], **kwargs) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/mmcv/runner/runner.py", line 264, in train self.model, data_batch, train_mode=True, **kwargs) File "/project/jhli/rzh/mmdetection_old/mmdet/apis/train.py", line 38, in batch_processor losses = model(**data) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 50, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/project/jhli/rzh/mmdetection_old/mmdet/core/fp16/decorators.py", line 49, in new_func return old_func(*args, **kwargs) File "/project/jhli/rzh/mmdetection_old/mmdet/models/detectors/base.py", line 86, in forward return self.forward_train(img, img_meta, **kwargs) File "/project/jhli/rzh/mmdetection_old/mmdet/models/detectors/single_stage_nms.py", line 69, in forward_train self.test_cfg) File "/project/jhli/rzh/mmdetection_old/mmdet/models/anchor_heads/fcos_head_nms.py", line 280, in forward_nms print(centernesses) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/tensor.py", line 159, in __repr__ return torch._tensor_str._str(self) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/_tensor_str.py", line 311, in _str tensor_str = _tensor_str(self, indent) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/_tensor_str.py", line 209, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/_tensor_str.py", line 235, in get_summarized_data return torch.cat((self[:PRINT_OPTS.edgeitems], self[-PRINT_OPTS.edgeitems:])) RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278 terminate called after throwing an instance of 'c10::Error' what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:764)which is traced to print(centernesses) if this sentence is uncommented; else, it would be traced to the next line:
centernesses = [ torch.cat(list(map(lambda x: x[i], centernesses)), dim = 0) for i in range(num_imgs)]To Reproduce
Steps to reproduce the behavior: Unknown. I experimentally created some tensors in gpu mode, but haven't caused the error at all. However, my object detection network training constantly crashed here.
There is also some detailed call stack records, which seems to me indicating the error is related to some garbage collection mechanism:
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x2b584b18f193 in /home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x17f66 (0x2b584af4af66 in /home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: <unknown function> + 0x19cbd (0x2b584af4ccbd in /home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x4d (0x2b584b17f63d in /home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/lib/libc10.so) frame #4: <unknown function> + 0x67bac2 (0x2b57ffe69ac2 in /home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #5: <unknown function> + 0x67bb66 (0x2b57ffe69b66 in /home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x19dfce (0x561487cb5fce in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #7: <unknown function> + 0x113a6b (0x561487c2ba6b in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #8: <unknown function> + 0x113bc7 (0x561487c2bbc7 in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #9: <unknown function> + 0x103948 (0x561487c1b948 in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #10: <unknown function> + 0x114267 (0x561487c2c267 in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #11: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #12: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #13: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #14: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #15: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #16: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #17: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #18: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #19: PyDict_SetItem + 0x502 (0x561487c77602 in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #20: PyDict_SetItemString + 0x4f (0x561487c780cf in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #21: PyImport_Cleanup + 0x9e (0x561487cb791e in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #22: Py_FinalizeEx + 0x67 (0x561487d2d367 in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #23: <unknown function> + 0x227d93 (0x561487d3fd93 in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #24: _Py_UnixMain + 0x3c (0x561487d400bc in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #25: __libc_start_main + 0xf5 (0x2b57ea8eb505 in /lib64/libc.so.6) frame #26: <unknown function> + 0x1d0990 (0x561487ce8990 in /home/jhli/project/.conda/envs/open-mmlab2/bin/python)Expected behavior
No such error occurs.
Environment
PyTorch version: 1.4.0 Is debug build: No CUDA used to build PyTorch: 10.1
OS: CentOS Linux 7 (Core) GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39) CMake version: Could not collect
Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 10.1.243 GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB GPU 1: Tesla V100-SXM2-16GB GPU 2: Tesla V100-SXM2-16GB GPU 3: Tesla V100-SXM2-16GB
Nvidia driver version: 440.64.00 cuDNN version: Could not collect
Versions of relevant libraries: [pip] numpy==1.17.0 [pip] torch==1.4.0 [pip] torchvision==0.4.0a0 [conda] _pytorch_select 0.2 gpu_0 defaults [conda] blas 1.0 mkl defaults [conda] mkl 2019.4 243 defaults [conda] mkl-service 2.3.0 py37he904b0f_0 defaults [conda] mkl_fft 1.0.15 py37ha843d7b_0 defaults [conda] mkl_random 1.1.0 py37hd6b4f25_0 defaults [conda] pytorch 1.2.0 cuda100py37h938c94c_0 defaults [conda] torch 1.4.0 pypi_0 pypi [conda] torchvision 0.4.0 cuda100py37hecfc37a_0 defaults
Additional context
None.
cc @ngimel
Metadata
Metadata
Assignees
No one assignedLabels
Related to torch.cuda, and CUDA support in generalEnsure you have actionable steps to reproduce the issue. Someone else needs to confirm the repro.This issue has been looked at a team member, and triaged and prioritized into an appropriate moduleType
No typeProjects
No projectsMilestone
No milestoneRelationships
None yetDevelopment
No branches or pull requestsIssue actions
You can’t perform that action at this time.Tag » Code 700 Reason An Illegal Memory Access Was Encountered
-
Incidental Error 700 - An Illegal Memory Access Is Encountered
-
Simple CUDA Test Always Fails With "an Illegal Memory Access Was ...
-
CUDA Error 700: An Illegal Memory Access Was Encountered #1946
-
CUDA Error: An Illegal Memory Access Was Encountered With ...
-
CUDA Error In :465 : An Illegal Memory Access Was ...
-
Illegal Memory Access Problem CUDA - GPU - Julia Discourse
-
An Empirical Method Of Debugging "illegal Memory Access" Bug In ...
-
Getting Cuda Error 700 Without Any Obvious Reason - ADocLib
-
PyTorch RuntimeError: CUDA Error: An Illegal Memory Access Was ...
-
An Illegal Memory Access Was Encountered(CUDA错误非法访问内存)
-
Pytorch报错:CUDA Error: An Illegal Memory Access Was Encountered
-
An Illegal Memory Access Was Encountered” A Few Times This Past Day.
-
CUDA Error: An Illegal Memory Access Was Encountered - Part 1 ...