I tried to run Imagenet example with https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py
At the end of the epoch I got the following error message:
Traceback (most recent call last):
File "main_amp.py", line 520, in
main()
File "main_amp.py", line 239, in main
train(train_loader, model, criterion, optimizer, epoch)
File "main_amp.py", line 345, in train
prec1, prec5 = accuracy(output.data, target, topk=(1, 5))
File "main_amp.py", line 504, in accuracy
correct = pred.eq(target.view(1, -1).expand_as(pred))
RuntimeError: The expanded size of the tensor (128) must match the existing size (96) at non-singleton dimension 1. Target sizes: [5, 128]. Tensor sizes: [1, 96]
The problem is the targets and and output have different size. Checking the code I see that new inputs and targets for the next batch are picked up at line 337
input, target = prefetcher.next()
This target is then used to calculate accuracy at line 345
prec1, prec5 = accuracy(output.data, target, topk=(1, 5))
The output.data in line 345 is not calculated on the input obtained in line 337. That causes a size mismatch. Also accuracy will not be calculated right.
I suggest moving line 337 after accuracy calculations and printing results to line 376
I tried to run Imagenet example with https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py
At the end of the epoch I got the following error message:
Traceback (most recent call last):
File "main_amp.py", line 520, in
main()
File "main_amp.py", line 239, in main
train(train_loader, model, criterion, optimizer, epoch)
File "main_amp.py", line 345, in train
prec1, prec5 = accuracy(output.data, target, topk=(1, 5))
File "main_amp.py", line 504, in accuracy
correct = pred.eq(target.view(1, -1).expand_as(pred))
RuntimeError: The expanded size of the tensor (128) must match the existing size (96) at non-singleton dimension 1. Target sizes: [5, 128]. Tensor sizes: [1, 96]
The problem is the targets and and output have different size. Checking the code I see that new inputs and targets for the next batch are picked up at line 337
input, target = prefetcher.next()
This target is then used to calculate accuracy at line 345
prec1, prec5 = accuracy(output.data, target, topk=(1, 5))
The output.data in line 345 is not calculated on the input obtained in line 337. That causes a size mismatch. Also accuracy will not be calculated right.
I suggest moving line 337 after accuracy calculations and printing results to line 376