-
Notifications
You must be signed in to change notification settings - Fork 5
Description
I used the datasets you provided,when I trained it was wrong,,how should I do? Waiting for your reply,thank you!
[07/03 15:22:15 d2.engine.train_loop]: Starting training from iteration 0
Failed to converge. l_inf norm is: 3.1583547592163086
Failed to converge. l_inf norm is: 4.175591468811035
Failed to converge. l_inf norm is: 12.365741729736328
Failed to converge. l_inf norm is: 6.024055480957031
Failed to converge. l_inf norm is: nan
Failed to converge. l_inf norm is: nan
Failed to converge. l_inf norm is: nan
Failed to converge. l_inf norm is: nan
Failed to converge. l_inf norm is: nan
Failed to converge. l_inf norm is: nan
Failed to converge. l_inf norm is: 1.2032318115234375
Failed to converge. l_inf norm is: 1.0705947875976562
Failed to converge. l_inf norm is: 6.957664489746094
Failed to converge. l_inf norm is: 4.854583740234375
Failed to converge. l_inf norm is: 27.565383911132812
Failed to converge. l_inf norm is: nan
Failed to converge. l_inf norm is: 18.025360107421875
Failed to converge. l_inf norm is: nan
Failed to converge. l_inf norm is: 19.686927795410156
Failed to converge. l_inf norm is: 12.671699523925781
Failed to converge. l_inf norm is: 10.474720001220703
Failed to converge. l_inf norm is: 21.66767120361328
Failed to converge. l_inf norm is: 9.996528625488281
Failed to converge. l_inf norm is: 18.128585815429688
Failed to converge. l_inf norm is: nan
Failed to converge. l_inf norm is: nan
ERROR [07/03 15:22:45 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/opt/detectron2_repo/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/host/deepformable/engine/trainers.py", line 203, in run_step
loss_dict = self.model(data)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/host/deepformable/engine/trainers.py", line 126, in forward
output = self.model(data)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/detectron2_repo/detectron2/modeling/meta_arch/rcnn.py", line 157, in forward
proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/detectron2_repo/detectron2/modeling/proposal_generator/rpn.py", line 478, in forward
anchors, pred_objectness_logits, pred_anchor_deltas, images.image_sizes
File "/opt/detectron2_repo/detectron2/modeling/proposal_generator/rpn.py", line 511, in predict_proposals
self.training,
File "/opt/detectron2_repo/detectron2/modeling/proposal_generator/proposal_utils.py", line 104, in find_top_rpn_proposals
"Predicted boxes or scores contain Inf/NaN. Training has diverged."
FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.
[07/03 15:22:45 d2.engine.hooks]: Total training time: 0:00:05 (0:00:00 on hooks)
[07/03 15:22:45 d2.utils.events]: iter: 2 total_loss: -5.302 loss_sample_reg: -0.09662 loss_corner_reg: -0.0106 objectness_loss: -0.816 decoding_loss: -1.006 loss_rpn_cls: -1.459 loss_rpn_loc: -1.914 data_time: 1.0866 lr: 4.036e-07 max_mem: 15650M
Traceback (most recent call last):
File "tools/train.py", line 98, in
args=(args,),
File "/opt/detectron2_repo/detectron2/engine/launch.py", line 79, in launch
daemon=False,
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/opt/detectron2_repo/detectron2/engine/launch.py", line 126, in _distributed_worker
main_func(*args)
File "/host/tools/train.py", line 73, in main
return trainer.train()
File "/host/deepformable/engine/trainers.py", line 189, in train
super().train(self.start_iter, self.max_iter)
File "/opt/detectron2_repo/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/host/deepformable/engine/trainers.py", line 203, in run_step
loss_dict = self.model(data)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/host/deepformable/engine/trainers.py", line 126, in forward
output = self.model(data)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/detectron2_repo/detectron2/modeling/meta_arch/rcnn.py", line 157, in forward
proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/detectron2_repo/detectron2/modeling/proposal_generator/rpn.py", line 478, in forward
anchors, pred_objectness_logits, pred_anchor_deltas, images.image_sizes
File "/opt/detectron2_repo/detectron2/modeling/proposal_generator/rpn.py", line 511, in predict_proposals
self.training,
File "/opt/detectron2_repo/detectron2/modeling/proposal_generator/proposal_utils.py", line 104, in find_top_rpn_proposals
"Predicted boxes or scores contain Inf/NaN. Training has diverged."
FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.