Cuda Runtime Error (700) : An Illegal Memory Access Was Encountered

Skip to content Dismiss alert {{ message }} / pytorch Public
  • Notifications You must be signed in to change notification settings
  • Fork 26.4k
  • Star 96.3k
  • Code
  • Issues 5k+
  • Pull requests 1.7k
  • Actions
  • Projects 12
  • Wiki
  • Security

    Uh oh!

    There was an error while loading. Please reload this page.

  • Insights
Additional navigation options cuda runtime error (700) : an illegal memory access was encountered #42077New issueNew issueClosedClosedcuda runtime error (700) : an illegal memory access was encountered#42077Labelsmodule: cudaRelated to torch.cuda, and CUDA support in generalneeds reproductionEnsure you have actionable steps to reproduce the issue. Someone else needs to confirm the repro.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module@HawkRong

Description

@HawkRongHawkRongopened on Jul 26, 2020

🐛 Bug

I performed some torch.chunk and torch.cat operations on tensors (Conv2D feature maps) during the training of my object detection network, and the forward propagation constantly crash here. Related codes are:

centernesses = [centerness.sigmoid() for centerness in centernesses] centernesses = [centerness.permute(0, 2, 3, 1).reshape(-1) for centerness in centernesses] centernesses = [torch.chunk(centerness, num_imgs, dim = 0) for centerness in centernesses] #print('centernesses (in forward_nms):') #print(centernesses) centernesses = [ torch.cat(list(map(lambda x: x[i], centernesses)), dim = 0) for i in range(num_imgs)]

The error messages are:

Traceback (most recent call last): File "./tools/train.py", line 109, in <module> main() File "./tools/train.py", line 104, in main logger=logger) File "/project/jhli/rzh/mmdetection_old/mmdet/apis/train.py", line 58, in train_detector _dist_train(model, dataset, cfg, validate=validate) File "/project/jhli/rzh/mmdetection_old/mmdet/apis/train.py", line 186, in _dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run epoch_runner(data_loaders[i], **kwargs) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/mmcv/runner/runner.py", line 264, in train self.model, data_batch, train_mode=True, **kwargs) File "/project/jhli/rzh/mmdetection_old/mmdet/apis/train.py", line 38, in batch_processor losses = model(**data) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 50, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/project/jhli/rzh/mmdetection_old/mmdet/core/fp16/decorators.py", line 49, in new_func return old_func(*args, **kwargs) File "/project/jhli/rzh/mmdetection_old/mmdet/models/detectors/base.py", line 86, in forward return self.forward_train(img, img_meta, **kwargs) File "/project/jhli/rzh/mmdetection_old/mmdet/models/detectors/single_stage_nms.py", line 69, in forward_train self.test_cfg) File "/project/jhli/rzh/mmdetection_old/mmdet/models/anchor_heads/fcos_head_nms.py", line 280, in forward_nms print(centernesses) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/tensor.py", line 159, in __repr__ return torch._tensor_str._str(self) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/_tensor_str.py", line 311, in _str tensor_str = _tensor_str(self, indent) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/_tensor_str.py", line 209, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/_tensor_str.py", line 235, in get_summarized_data return torch.cat((self[:PRINT_OPTS.edgeitems], self[-PRINT_OPTS.edgeitems:])) RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278 terminate called after throwing an instance of 'c10::Error' what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:764)

which is traced to print(centernesses) if this sentence is uncommented; else, it would be traced to the next line:

centernesses = [ torch.cat(list(map(lambda x: x[i], centernesses)), dim = 0) for i in range(num_imgs)]

To Reproduce

Steps to reproduce the behavior: Unknown. I experimentally created some tensors in gpu mode, but haven't caused the error at all. However, my object detection network training constantly crashed here.

There is also some detailed call stack records, which seems to me indicating the error is related to some garbage collection mechanism:

frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x2b584b18f193 in /home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x17f66 (0x2b584af4af66 in /home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: <unknown function> + 0x19cbd (0x2b584af4ccbd in /home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x4d (0x2b584b17f63d in /home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/lib/libc10.so) frame #4: <unknown function> + 0x67bac2 (0x2b57ffe69ac2 in /home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #5: <unknown function> + 0x67bb66 (0x2b57ffe69b66 in /home/jhli/project/.conda/envs/open-mmlab2/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x19dfce (0x561487cb5fce in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #7: <unknown function> + 0x113a6b (0x561487c2ba6b in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #8: <unknown function> + 0x113bc7 (0x561487c2bbc7 in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #9: <unknown function> + 0x103948 (0x561487c1b948 in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #10: <unknown function> + 0x114267 (0x561487c2c267 in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #11: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #12: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #13: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #14: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #15: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #16: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #17: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #18: <unknown function> + 0x11427d (0x561487c2c27d in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #19: PyDict_SetItem + 0x502 (0x561487c77602 in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #20: PyDict_SetItemString + 0x4f (0x561487c780cf in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #21: PyImport_Cleanup + 0x9e (0x561487cb791e in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #22: Py_FinalizeEx + 0x67 (0x561487d2d367 in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #23: <unknown function> + 0x227d93 (0x561487d3fd93 in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #24: _Py_UnixMain + 0x3c (0x561487d400bc in /home/jhli/project/.conda/envs/open-mmlab2/bin/python) frame #25: __libc_start_main + 0xf5 (0x2b57ea8eb505 in /lib64/libc.so.6) frame #26: <unknown function> + 0x1d0990 (0x561487ce8990 in /home/jhli/project/.conda/envs/open-mmlab2/bin/python)

Expected behavior

No such error occurs.

Environment

PyTorch version: 1.4.0 Is debug build: No CUDA used to build PyTorch: 10.1

OS: CentOS Linux 7 (Core) GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39) CMake version: Could not collect

Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 10.1.243 GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB GPU 1: Tesla V100-SXM2-16GB GPU 2: Tesla V100-SXM2-16GB GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 440.64.00 cuDNN version: Could not collect

Versions of relevant libraries: [pip] numpy==1.17.0 [pip] torch==1.4.0 [pip] torchvision==0.4.0a0 [conda] _pytorch_select 0.2 gpu_0 defaults [conda] blas 1.0 mkl defaults [conda] mkl 2019.4 243 defaults [conda] mkl-service 2.3.0 py37he904b0f_0 defaults [conda] mkl_fft 1.0.15 py37ha843d7b_0 defaults [conda] mkl_random 1.1.0 py37hd6b4f25_0 defaults [conda] pytorch 1.2.0 cuda100py37h938c94c_0 defaults [conda] torch 1.4.0 pypi_0 pypi [conda] torchvision 0.4.0 cuda100py37hecfc37a_0 defaults

Additional context

None.

cc @ngimel

Metadata

Metadata

Assignees

No one assigned

Labels

module: cudaRelated to torch.cuda, and CUDA support in generalneeds reproductionEnsure you have actionable steps to reproduce the issue. Someone else needs to confirm the repro.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

You can’t perform that action at this time.

Tag » Code 700 Reason An Illegal Memory Access Was Encountered