Multi-GPU K80s

I'm having trouble getting multi-gpu via `DataParallel` across two Tesla K80 GPUs. The code I'm using is a modification of the MNIST example:

```
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.autograd import Variable
from data_parallel import DataParallel

train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=True, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=256, shuffle=True, num_workers=2, pin_memory=True)

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2(x), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x)

model = DataParallel(Net())
model.cuda()

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
criterion = nn.NLLLoss().cuda()

model.train()
for batch_idx, (data, target) in enumerate(train_loader):
    input_var = Variable(data.cuda())
    target_var = Variable(target.cuda())

    print('Getting model output')
    output = model(input_var)
    print('Got model output')

    loss = criterion(output, target_var)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print('Finished')
```

This doesn't throw an error, but hangs after it prints "Getting model output" and never returns. I traced this down to the `parallel_apply` spawning threads that then never finish. The line that hangs is [here](https://github.com/pytorch/pytorch/blob/e50a1f19b3dc735f0710929b97b0af384aafe09b/torch/nn/parallel/parallel_apply.py#L25) where the threads are spawned using both GPU 0 and GPU 1, but never finish.

This is only a problem when `CUDA_VISIBLE_DEVICES=0,1` as both GPU0 and GPU1 work perfectly well individually.

Before running this, `nvidia-smi` shows

```
+------------------------------------------------------+
| NVIDIA-SMI 352.68     Driver Version: 352.68         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:06:00.0     Off |                    0 |
| N/A   40C    P0    57W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:07:00.0     Off |                    0 |
| N/A   35C    P0    76W / 149W |     55MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
```

after running (while it hangs), `nvidia-smi` gives

```
+------------------------------------------------------+
| NVIDIA-SMI 352.68     Driver Version: 352.68         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:06:00.0     Off |                    0 |
| N/A   42C    P0    69W / 149W |    251MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:07:00.0     Off |                    0 |
| N/A   36C    P0    90W / 149W |    249MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      4785    C   python                                         194MiB |
|    1      4785    C   python                                         192MiB |
+-----------------------------------------------------------------------------+
```

and `top` shows the main python process and the two python subprocesses. Wondering if this could be something similar to #554.


Using [this](https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py) TensorFlow example, I get linear speedup using multiple GPUs as I change `CUDA_VISIBLE_DEVICES` so multiple K80s should certainly be viable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-GPU K80s #1637

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-GPU K80s #1637

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions