nccl_ops.all_sum does not correctly reduce gradients

**System information**
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ubuntu 18.04
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:n/a
- TensorFlow installed from (source or binary):binary
- TensorFlow version (use command below): v2.2.0-rc3-33-g70087ab4f4 2.2.0-rc4
- Python version:3.7
- Bazel version (if compiling from source):n/a
- GCC/Compiler version (if compiling from source):n/a
- CUDA/cuDNN version:10.1/7.6.5
- GPU model and memory:P100, V100

**Describe the current behavior**
The allreduce operation `nccl_ops.all_sum` does not correctly sum gradients. The results are __incorrect__.


**Standalone code to reproduce the issue**
```python
#!/usr/bin/env python
import argparse
from tensorflow.compat import v1 as tf
import tqdm

def split_grad_list(grad_list):
    g = []
    v = []
    for tower in grad_list:
        g.append([x[0] for x in tower])
        v.append([x[1] for x in tower])
    return g, v

def allreduce_grads(all_grads):
    # reduce gradients for N variables on K devices
    from tensorflow.python.ops import nccl_ops as nccl
    nr_tower = len(all_grads)
    assert nr_tower > 1
    new_all_grads = []  # N x K
    for grads in zip(*all_grads):
        # k grads
        summed = nccl.all_sum(grads)

        grads_for_devices = []  # K
        true_sum = tf.add_n(grads)
        for g in summed:
            diff = tf.abs(true_sum - g)
            eql = diff < 1e-4
            nccl_res_correct = tf.reduce_all(eql, name="corr_" + grads[0].op.name)

            def flat(x):
                x = tf.reshape(x, [-1])
                x = tf.slice(x, [0], [tf.minimum(tf.size(x), 200)])
                return x

            assert_op = tf.debugging.Assert(nccl_res_correct, [
                tf.reduce_max(diff), flat(true_sum), flat(g)], summarize=1000,
                name='assert_' + grads[0].op.name)
            with tf.control_dependencies([assert_op]):
                g = tf.identity(g)
            grads_for_devices.append(g)
        new_all_grads.append(grads_for_devices)
    # transpose to K x N
    ret = list(zip(*new_all_grads))
    return ret

def build_graph(image, label, idx):
    v1 = tf.get_variable('aaa/W', shape=[3, 3, 3, 64], trainable=True)
    v2 = tf.get_variable('bbb/W', shape=[3, 3, 3, 64], trainable=True)
    v = v1 if idx == 0 else v2
    image = tf.nn.conv2d(image, v, 1, padding='SAME', data_format='NCHW')

    def conv(name, x, chan, stride=1):
        with tf.variable_scope(name):
            in_chan = x.shape[1]
            W = tf.get_variable('W', [3, 3, in_chan, chan])
            ret = tf.nn.conv2d(x, W, strides=stride, padding="SAME", data_format="NCHW")
            return tf.nn.relu(ret)

    x = conv('conv1', image, 64)
    x = conv('conv2', x, 64)
    x = conv('conv3', x, 1280, stride=2)
    x = conv('conv4', x, 1280, stride=2)
    x = conv('conv5', x, 10)
    logits = tf.reduce_mean(x, axis=[2, 3])
    cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label)
    cost = tf.reduce_mean(cost, name='cross_entropy_loss')
    return cost


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--gpu', type=int)
    args = parser.parse_args()
    num_gpu = args.gpu

    with tf.Graph().as_default():
        opt = tf.train.GradientDescentOptimizer(0.001)

        grad_list = []
        for k in range(num_gpu):
            with tf.device("/gpu:{}".format(k)), tf.variable_scope("tower{}".format(k)):
                print("Building {} ...".format(k))
                image = tf.random.uniform([32, 3, 30, 30])
                label = tf.random.uniform([32], maxval=9, dtype=tf.int32)
                cost = build_graph(image, label, k)
                varlist = [x for x in tf.trainable_variables() if x.name.startswith("tower{}".format(k))]
                print("Varlist for tower {}: ".format(k), [x.name for x in varlist])
                wd_cost = [tf.reduce_sum(x) * 1e-3 for x in varlist]
                cost = tf.add_n([cost] + wd_cost)
                grads = opt.compute_gradients(cost, var_list=varlist)
                grad_list.append(grads)

        all_grads, all_vars = split_grad_list(grad_list)
        all_grads = allreduce_grads(all_grads)
        grad_list = [list(zip(gs, vs)) for gs, vs in zip(all_grads, all_vars)]

        train_ops = []
        for idx, grad_and_vars in enumerate(grad_list):
            with tf.device('/gpu:{}'.format(idx)):
                train_ops.append(opt.apply_gradients(
                    grad_and_vars, name='apply_grad_{}'.format(idx)))
        train_op = tf.group(*train_ops)

        sess = tf.Session()
        sess.run(tf.global_variables_initializer())
        print("Training ...")
        for k in tqdm.trange(5000):
            sess.run(train_op)
```

The above code trains a toy network on random data, and allreduce the gradients using `nccl_ops.all_sum`. It checks the allreduce results against the sum of gradients computed by a naive `add_n`, and asserts that the difference is reasonably small. However, the difference can be quite large sometimes and the assertion usually fails within 100 steps of training.

The code above (written in TF1 style) can be run on a machine with >=2 GPUs using
```
$ TF2_BEHAVIOR=0 python a.py --gpu 2
Building 0 ...
 Varlist for tower 0:  ['tower0/aaa/W:0', 'tower0/bbb/W:0', 'tower0/conv1/W:0', 'tower0/conv2/W:0', 'tower0/conv3/W:0', 'tower0/conv4/W:0', 'tower0/conv5/W:0']                                      
Building 1 ...  
Varlist for tower 1:  ['tower1/aaa/W:0', 'tower1/bbb/W:0', 'tower1/conv1/W:0', 'tower1/conv2/W:0', 'tower1/conv3/W:0', 'tower1/conv4/W:0', 'tower1/conv5/W:0'] 
1%|▉                                                                    | 71/5000 [00:06<07:39, 10.73it/s]    
Traceback (most recent call last):                                                                                                                                                                  
  File "/private/home/yuxinwu/env/py37-tf2.2v2/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call                                                             
    return fn(*args)                                                                                                                                                                                
  File "/private/home/yuxinwu/env/py37-tf2.2v2/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn                                                              
    target_list, run_metadata)                                                                                                                                                                      
  File "/private/home/yuxinwu/env/py37-tf2.2v2/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun                                                  
    run_metadata)                                                                                                                                                                                   
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.                                                                                                                
  (0) Invalid argument: assertion failed: [0.00100000016] [0.00234295963 0.00230941921 0.00176228327 0.00197261758 0.00213356828 0.00188576151 0.00211580051 0.00221353304 
```

My initial investigation suggests (no proof, just a guess) that the bug might appear because the gradients are computed on each GPU in different order.

The bug was found to exist in TF 1.15 as well. Have not tested earlier versions.
The bug rarely triggers itself if I revert https://github.com/tensorflow/tensorflow/pull/31481, which is a PR that make allreduce ops scheduled as early as possible. 
`collective_ops.all_reduce` with the ring implementation does not seem to have similar issue, but it significantly slows down my training.

cc @dubey @yuefengz @chsigg  who may have context on this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nccl_ops.all_sum does not correctly reduce gradients #41539

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

nccl_ops.all_sum does not correctly reduce gradients #41539

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions