Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Multi-GPU K80s #1637

@davidmascharka

Description

@davidmascharka

I'm having trouble getting multi-gpu via DataParallel across two Tesla K80 GPUs. The code I'm using is a modification of the MNIST example:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.autograd import Variable
from data_parallel import DataParallel

train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=True, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=256, shuffle=True, num_workers=2, pin_memory=True)

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2(x), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x)

model = DataParallel(Net())
model.cuda()

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
criterion = nn.NLLLoss().cuda()

model.train()
for batch_idx, (data, target) in enumerate(train_loader):
    input_var = Variable(data.cuda())
    target_var = Variable(target.cuda())

    print('Getting model output')
    output = model(input_var)
    print('Got model output')

    loss = criterion(output, target_var)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print('Finished')

This doesn't throw an error, but hangs after it prints "Getting model output" and never returns. I traced this down to the parallel_apply spawning threads that then never finish. The line that hangs is here where the threads are spawned using both GPU 0 and GPU 1, but never finish.

This is only a problem when CUDA_VISIBLE_DEVICES=0,1 as both GPU0 and GPU1 work perfectly well individually.

Before running this, nvidia-smi shows

+------------------------------------------------------+
| NVIDIA-SMI 352.68     Driver Version: 352.68         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:06:00.0     Off |                    0 |
| N/A   40C    P0    57W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:07:00.0     Off |                    0 |
| N/A   35C    P0    76W / 149W |     55MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

after running (while it hangs), nvidia-smi gives

+------------------------------------------------------+
| NVIDIA-SMI 352.68     Driver Version: 352.68         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:06:00.0     Off |                    0 |
| N/A   42C    P0    69W / 149W |    251MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:07:00.0     Off |                    0 |
| N/A   36C    P0    90W / 149W |    249MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      4785    C   python                                         194MiB |
|    1      4785    C   python                                         192MiB |
+-----------------------------------------------------------------------------+

and top shows the main python process and the two python subprocesses. Wondering if this could be something similar to #554.

Using this TensorFlow example, I get linear speedup using multiple GPUs as I change CUDA_VISIBLE_DEVICES so multiple K80s should certainly be viable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions