SVD on GPU is slower than SVD on CPU

OS:
- **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: CentOS release 7.4.1708
- **TensorFlow installed from (source or binary)**: From source
- **Python version**: 2.7.13
- **Bazel version**: 0.6.1
- **CUDA/cuDNN version**: CUDA 8.0/cuDNN 6.0.21
- **GPU model and memory**: GeForce GTX 950M, memory 4GB

output of `tf_env_collect.sh`
```

== cat /etc/issue ===============================================
Linux zhanghao 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

== are we in docker =============================================
No

== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
Copyright © 2015 Free Software Foundation, Inc.
本程序是自由软件；请参看源代码的版权声明。本软件没有任何担保；
包括没有适销性和某一专用目的下的适用性担保。

== uname -a =====================================================
Linux zhanghao 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================

== check for virtualenv =========================================
False

== tensorflow import ============================================
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named tensorflow

== env ==========================================================
LD_LIBRARY_PATH is unset
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Tue Oct 10 16:36:08 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 950M    Off  | 00000000:0A:00.0 Off |                  N/A |
| N/A   45C    P0    N/A /  N/A |      0MiB /  4044MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================

== cat /etc/issue ===============================================
Linux zhanghao 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

== are we in docker =============================================
No

== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
Copyright © 2015 Free Software Foundation, Inc.
本程序是自由软件；请参看源代码的版权声明。本软件没有任何担保；
包括没有适销性和某一专用目的下的适用性担保。

== uname -a =====================================================
Linux zhanghao 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy (1.12.1)
protobuf (3.4.0)
tensorflow (1.4.0rc0)
tensorflow-tensorboard (0.4.0rc1)

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.4.0-rc0
tf.GIT_VERSION = v1.3.0-rc1-3111-g4196d6d
tf.COMPILER_VERSION = v1.3.0-rc1-3111-g4196d6d
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH /usr/local/cuda/lib64/:/usr/local/cuda/lib64/stubs/:/usr/local/cuda/extras/CUPTI/lib64/:/usr/local/cuda/nvvm/lib64/:/usr/lib64/nvidia/:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/ipp/lib/intel64:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/tbb/lib/intel64/gcc4.7:/opt/intel/debugger_2017/iga/lib:/opt/intel/debugger_2017/libipt/intel64/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/daal/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/daal/../tbb/lib/intel64_lin/gcc4.4
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Tue Oct 10 16:36:37 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 950M    Off  | 00000000:0A:00.0 Off |                  N/A |
| N/A   45C    P0    N/A /  N/A |      0MiB /  4044MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/cuda-8.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-8.0/doc/man/man7/libcudart.7
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.61
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart_static.a
```

output of `python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"`
```
('v1.3.0-rc1-3111-g4196d6d', '1.4.0-rc0')
```

### Describe the problem

SVD on GPU is slower than SVD on CPU

### Source code / logs

file main.py
```
import tensorflow as tf
import numpy as np
import sys

D = 1024
dA = np.random.normal(size=(D,D))

dev = "/gpu:0" if len(sys.argv)==1 else "/cpu:0"

with tf.device(dev):
    A = tf.placeholder(shape=(D,D),dtype=tf.float32)
    S, U, V = tf.svd(A)

config = tf.ConfigProto()
config.log_device_placement = True
config.graph_options.optimizer_options.global_jit_level=tf.OptimizerOptions.ON_1
sess = tf.Session(config=config)

for _ in xrange(10):
    dS, dU, dV = sess.run((S, U, V), feed_dict={A:dA})
```

## run on GPU
`time python main.py`
```
2017-10-10 16:28:49.047703: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-10 16:28:49.048176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate(GHz): 1.124
pciBusID: 0000:0a:00.0
totalMemory: 3.95GiB freeMemory: 3.91GiB
2017-10-10 16:28:49.048205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0
2017-10-10 16:28:49.064960: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0

Svd: (Svd): /job:localhost/replica:0/task:0/device:GPU:0
2017-10-10 16:28:49.067234: I tensorflow/core/common_runtime/placer.cc:874] Svd: (Svd)/job:localhost/replica:0/task:0/device:GPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
2017-10-10 16:28:49.067302: I tensorflow/core/common_runtime/placer.cc:874] Placeholder: (Placeholder)/job:localhost/replica:0/task:0/device:GPU:0
2017-10-10 16:28:49.074053: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x488e860
python main.py  27.50s user 2.30s system 100% cpu 29.658 total
```

## run on CPU
`time python main.py -`
```
2017-10-10 16:29:53.252138: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-10 16:29:53.252572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate(GHz): 1.124
pciBusID: 0000:0a:00.0
totalMemory: 3.95GiB freeMemory: 3.91GiB
2017-10-10 16:29:53.252600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0
2017-10-10 16:29:53.269242: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0

Svd: (Svd): /job:localhost/replica:0/task:0/device:CPU:0
2017-10-10 16:29:53.271505: I tensorflow/core/common_runtime/placer.cc:874] Svd: (Svd)/job:localhost/replica:0/task:0/device:CPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:CPU:0
2017-10-10 16:29:53.271544: I tensorflow/core/common_runtime/placer.cc:874] Placeholder: (Placeholder)/job:localhost/replica:0/task:0/device:CPU:0
python main.py -  34.33s user 10.68s system 621% cpu 7.241 total
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SVD on GPU is slower than SVD on CPU #13603

Describe the problem

Source code / logs

run on GPU

run on CPU

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SVD on GPU is slower than SVD on CPU #13603

Description

Describe the problem

Source code / logs

run on GPU

run on CPU

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions