Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Memory errors? #25

@mattloose

Description

@mattloose

Hi,

I'm trying to run Herro using an A100 card. Not clear why it is running out of memory. I was running with -b 128 so I've dropped that down.

I'm running with singularity. Any suggestions?

Thanks

[00:01:26] Processing 1/? batch _
[>---------------------------------------] 93/90774 [W manager.cpp:340] Warning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback
(function runCudaFusionGroup)
thread '' panicked at src/inference.rs:172:64:
called Result::unwrap() on an Err value: Torch("The following operation failed in the TorchScript interpreter.\nTraceback of TorchScript (most recent call last):\nRuntimeError: The following operation failed in the TorchScript interpreter.\nTraceback of TorchScript, serialized code (most recent call last):\n File "code/torch/model.py", line 36, in fallback_cuda_fuser\n x0 = torch.permute(x, [0, 3, 1, 2])\n qn = self.qn\n sliced_sequences_concatenated = (qn).forward(x0, target_positions, lengths, )\n ~~~~~~~~~~~ <--- HERE\n fc2 = self.fc2\n _1 = (fc2).forward(sliced_sequences_concatenated, )\n File "code/torch/transformer.py", line 16, in forward\n _0 = torch.torch.nn.utils.rnn.pad_sequence\n context_read = self.context_read\n x0 = (context_read).forward(x, )\n ~~~~~~~~~~~~~~~~~~~~~ <--- HERE\n context_pos = self.context_pos\n x1 = (context_pos).forward(x0, )\n File "code/torch/torch/nn/modules/container.py", line 15, in forward\n _2 = getattr(self, "2")\n input0 = (_0).forward(input, )\n input1 = (_1).forward(input0, )\n ~~~~~~~~~~~ <--- HERE\n return (_2).forward(input1, )\n def len(self: torch.torch.nn.modules.container.Sequential) -> int:\n File "code/torch/torch/nn/modules/batchnorm.py", line 35, in forward\n weight = self.weight\n bias = self.bias\n _3 = _0(input, running_mean, running_var, weight, bias, bn_training, 0.10000000000000001, 1.0000000000000001e-05, )\n ~~ <--- HERE\n return _3\n def _check_input_dim(self: torch.torch.nn.modules.batchnorm.BatchNorm2d,\n File "code/torch/torch/nn/functional.py", line 52, in batch_norm\n else:\n pass\n _6 = torch.batch_norm(input, weight, bias, running_mean, running_var, training, momentum, eps, True)\n ~~~~~~~~~~~~~~~~ <--- HERE\n return _6\ndef relu(input: Tensor,\n\nTraceback of TorchScript, original code (most recent call last):\n File "/raid/scratch/stanojevicd/projects/haec-BigBird/model.py", line 157, in fallback_cuda_fuser\n sliced_sequences_concatenated = torch.cat(encoded)'''\n x = x.permute((0, 3, 1, 2))\n sliced_sequences_concatenated = self.qn(x, target_positions, lengths)\n ~~~~~~~ <--- HERE\n \n # list of tensors of shape (selected_token_number, 1) -> (selected_token_number)\n File "/raid/scratch/stanojevicd/projects/haec-BigBird/transformer.py", line 36, in forward\n def forward(self, x: Tensor, target_positions: List[Tensor],\n lengths: Tensor) -> Tensor:\n x = self.context_read(x) # [B, I, L, R] -> [B, 128, L, R]\n ~~~~~~~~~~~~~~~~~ <--- HERE\n x = self.context_pos(x) # [B, 128, L, R] -> [B, 256, L, 1]\n x = x.squeeze(-1).transpose(1, 2) # [B, L, 256]\n File "/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/modules/container.py", line 215, in forward\n def forward(self, input):\n for module in self:\n input = module(input)\n ~~~~~~ <--- HERE\n return input\n File "/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward\n used for normalization (i.e. in eval mode when buffers are not None).\n """\n return F.batch_norm(\n ~~~~~~~~~~~~ <--- HERE\n input,\n # If buffers are not to be tracked, ensure that they won't be updated\n File "/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/functional.py", line 2478, in batch_norm\n _verify_batch_size(input.size())\n\n return torch.batch_norm(\n ~~~~~~~~~~~~~~~~ <--- HERE\n input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled\n )\nRuntimeError: CUDA out of memory. Tried to allocate 4.64 GiB (GPU 0; 9.50 GiB total capacity; 4.74 GiB already allocated; 1.93 GiB free; 7.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF\n\n")
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Aborted (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions