Thanks to visit codestin.com
Credit goes to github.com

Skip to content

nccl_ops.all_sum does not correctly reduce gradients #41539

@ppwwyyxx

Description

@ppwwyyxx

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:n/a
  • TensorFlow installed from (source or binary):binary
  • TensorFlow version (use command below): v2.2.0-rc3-33-g70087ab4f4 2.2.0-rc4
  • Python version:3.7
  • Bazel version (if compiling from source):n/a
  • GCC/Compiler version (if compiling from source):n/a
  • CUDA/cuDNN version:10.1/7.6.5
  • GPU model and memory:P100, V100

Describe the current behavior
The allreduce operation nccl_ops.all_sum does not correctly sum gradients. The results are incorrect.

Standalone code to reproduce the issue

#!/usr/bin/env python
import argparse
from tensorflow.compat import v1 as tf
import tqdm

def split_grad_list(grad_list):
    g = []
    v = []
    for tower in grad_list:
        g.append([x[0] for x in tower])
        v.append([x[1] for x in tower])
    return g, v

def allreduce_grads(all_grads):
    # reduce gradients for N variables on K devices
    from tensorflow.python.ops import nccl_ops as nccl
    nr_tower = len(all_grads)
    assert nr_tower > 1
    new_all_grads = []  # N x K
    for grads in zip(*all_grads):
        # k grads
        summed = nccl.all_sum(grads)

        grads_for_devices = []  # K
        true_sum = tf.add_n(grads)
        for g in summed:
            diff = tf.abs(true_sum - g)
            eql = diff < 1e-4
            nccl_res_correct = tf.reduce_all(eql, name="corr_" + grads[0].op.name)

            def flat(x):
                x = tf.reshape(x, [-1])
                x = tf.slice(x, [0], [tf.minimum(tf.size(x), 200)])
                return x

            assert_op = tf.debugging.Assert(nccl_res_correct, [
                tf.reduce_max(diff), flat(true_sum), flat(g)], summarize=1000,
                name='assert_' + grads[0].op.name)
            with tf.control_dependencies([assert_op]):
                g = tf.identity(g)
            grads_for_devices.append(g)
        new_all_grads.append(grads_for_devices)
    # transpose to K x N
    ret = list(zip(*new_all_grads))
    return ret

def build_graph(image, label, idx):
    v1 = tf.get_variable('aaa/W', shape=[3, 3, 3, 64], trainable=True)
    v2 = tf.get_variable('bbb/W', shape=[3, 3, 3, 64], trainable=True)
    v = v1 if idx == 0 else v2
    image = tf.nn.conv2d(image, v, 1, padding='SAME', data_format='NCHW')

    def conv(name, x, chan, stride=1):
        with tf.variable_scope(name):
            in_chan = x.shape[1]
            W = tf.get_variable('W', [3, 3, in_chan, chan])
            ret = tf.nn.conv2d(x, W, strides=stride, padding="SAME", data_format="NCHW")
            return tf.nn.relu(ret)

    x = conv('conv1', image, 64)
    x = conv('conv2', x, 64)
    x = conv('conv3', x, 1280, stride=2)
    x = conv('conv4', x, 1280, stride=2)
    x = conv('conv5', x, 10)
    logits = tf.reduce_mean(x, axis=[2, 3])
    cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label)
    cost = tf.reduce_mean(cost, name='cross_entropy_loss')
    return cost


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--gpu', type=int)
    args = parser.parse_args()
    num_gpu = args.gpu

    with tf.Graph().as_default():
        opt = tf.train.GradientDescentOptimizer(0.001)

        grad_list = []
        for k in range(num_gpu):
            with tf.device("/gpu:{}".format(k)), tf.variable_scope("tower{}".format(k)):
                print("Building {} ...".format(k))
                image = tf.random.uniform([32, 3, 30, 30])
                label = tf.random.uniform([32], maxval=9, dtype=tf.int32)
                cost = build_graph(image, label, k)
                varlist = [x for x in tf.trainable_variables() if x.name.startswith("tower{}".format(k))]
                print("Varlist for tower {}: ".format(k), [x.name for x in varlist])
                wd_cost = [tf.reduce_sum(x) * 1e-3 for x in varlist]
                cost = tf.add_n([cost] + wd_cost)
                grads = opt.compute_gradients(cost, var_list=varlist)
                grad_list.append(grads)

        all_grads, all_vars = split_grad_list(grad_list)
        all_grads = allreduce_grads(all_grads)
        grad_list = [list(zip(gs, vs)) for gs, vs in zip(all_grads, all_vars)]

        train_ops = []
        for idx, grad_and_vars in enumerate(grad_list):
            with tf.device('/gpu:{}'.format(idx)):
                train_ops.append(opt.apply_gradients(
                    grad_and_vars, name='apply_grad_{}'.format(idx)))
        train_op = tf.group(*train_ops)

        sess = tf.Session()
        sess.run(tf.global_variables_initializer())
        print("Training ...")
        for k in tqdm.trange(5000):
            sess.run(train_op)

The above code trains a toy network on random data, and allreduce the gradients using nccl_ops.all_sum. It checks the allreduce results against the sum of gradients computed by a naive add_n, and asserts that the difference is reasonably small. However, the difference can be quite large sometimes and the assertion usually fails within 100 steps of training.

The code above (written in TF1 style) can be run on a machine with >=2 GPUs using

$ TF2_BEHAVIOR=0 python a.py --gpu 2
Building 0 ...
 Varlist for tower 0:  ['tower0/aaa/W:0', 'tower0/bbb/W:0', 'tower0/conv1/W:0', 'tower0/conv2/W:0', 'tower0/conv3/W:0', 'tower0/conv4/W:0', 'tower0/conv5/W:0']                                      
Building 1 ...  
Varlist for tower 1:  ['tower1/aaa/W:0', 'tower1/bbb/W:0', 'tower1/conv1/W:0', 'tower1/conv2/W:0', 'tower1/conv3/W:0', 'tower1/conv4/W:0', 'tower1/conv5/W:0'] 
1%|▉                                                                    | 71/5000 [00:06<07:39, 10.73it/s]    
Traceback (most recent call last):                                                                                                                                                                  
  File "/private/home/yuxinwu/env/py37-tf2.2v2/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call                                                             
    return fn(*args)                                                                                                                                                                                
  File "/private/home/yuxinwu/env/py37-tf2.2v2/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn                                                              
    target_list, run_metadata)                                                                                                                                                                      
  File "/private/home/yuxinwu/env/py37-tf2.2v2/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun                                                  
    run_metadata)                                                                                                                                                                                   
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.                                                                                                                
  (0) Invalid argument: assertion failed: [0.00100000016] [0.00234295963 0.00230941921 0.00176228327 0.00197261758 0.00213356828 0.00188576151 0.00211580051 0.00221353304 

My initial investigation suggests (no proof, just a guess) that the bug might appear because the gradients are computed on each GPU in different order.

The bug was found to exist in TF 1.15 as well. Have not tested earlier versions.
The bug rarely triggers itself if I revert #31481, which is a PR that make allreduce ops scheduled as early as possible.
collective_ops.all_reduce with the ring implementation does not seem to have similar issue, but it significantly slows down my training.

cc @dubey @yuefengz @chsigg who may have context on this issue.

Metadata

Metadata

Assignees

Labels

TF 2.2Issues related to TF 2.2comp:opsOPs related issuestype:bugBug

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions