Codestin Search App

jamesr66a · 2019-04-08T22:03:02Z

I've been messing around with vectorizing the fusion compiler in JIT, and noticed that these ops were pathologically slow. I moved them to use TensorIterator + Vec256<> and got some speed wins.

Benchmark script:

import torch, time

ops = ['abs', 'neg', 'reciprocal', 'frac']

x = torch.rand(1024, 1024)
NITER = 10000

print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t')

for op in ops:
    s = time.time()
    for i in range(NITER):
        getattr(x, op)()
    elapsed_sec = ((time.time() - s) / NITER)
    print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t')

Before this change (on my mac with a skylake):

op      time per iter (ms)      gops/s  GB/s
abs     0.9730974197387695      1.0775652866097343      8.620522292877874
neg     1.0723679780960083      0.9778136063534356      7.822508850827485
reciprocal      1.2610594034194946      0.8315040490215421      6.6520323921723366
frac    1.1681334018707275      0.8976509004200546      7.181207203360437

After this change:

op      time per iter (ms)      gops/s  GB/s
abs     0.5031076192855835      2.084198210889721       16.673585687117768
neg     0.4433974027633667      2.3648672578256087      18.91893806260487
reciprocal      0.47145988941192624     2.2241043693195985      17.79283495455679
frac    0.5036592721939087      2.0819154096627024      16.65532327730162

So, after this change it looks like we are hitting machine peak for bandwidth and are bandwidth bound.

cpuhrsch · 2019-04-08T23:56:38Z

  Tensor & fill_(const Tensor & value);
  Tensor floor() const;
  Tensor & floor_();
+  Tensor frac() const;


Why are these moving around?

I moved the declarations in native_functions.yaml so that they're not under this comment: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/native_functions.yaml#L3182

cpuhrsch · 2019-04-08T23:57:26Z

Did you look at Declarations.cwrap to see if you can remove some legacy functions or restricted them to CUDA etc.?

cpuhrsch · 2019-04-09T00:01:47Z

  return result;
 }

+// ***** abs *****


I think you can make use of some of the below macros, no?

If not, I think it'd be cleaner to add, there seems to be a lot of commonality between these functions.

cpuhrsch · 2019-04-09T00:03:04Z

+        iter,
+        [=](scalar_t a) -> scalar_t { return a - std::trunc(a); },
+        [=](Vec256<scalar_t> a) {
+          return a - a.trunc();


It's better to add this to vec256 as a ".frac()' function in case we'll find a better way of doing this down the road. Same for neg below. Then you can also use the macros to get rid of the boilerplate.

cpuhrsch · 2019-04-09T03:29:31Z

 #include <ATen/Parallel.h>
 #include <ATen/native/UnaryOps.h>
 #include <ATen/native/TensorIterator.h>
+#include <ATen/cpu/vec256/vec256_base.h>


Why is this one needed?

This is old from the original implementation. Can delete

cpuhrsch · 2019-04-09T03:33:38Z

Could you revisit the relevant tests for this and check for dtype coverage and large input tensors? This has been an issue in the past.

cpuhrsch · 2019-04-09T03:38:36Z


+// Negation. Defined here so we can utilize operator-
+
+Vec256<int64_t> Vec256<int64_t>::neg() const {


There does exist an xor instruction for integers as well (for AVX2+). - This could aid further optimization.

gchanan · 2019-04-09T15:43:51Z

        ('cosh', (S, S, S), NO_ARGS, '', (True,)),
        ('cosh', (), NO_ARGS, 'scalar', (True,)),
-        ('abs', (S, S, S), NO_ARGS, '', (True,)),
+        ('abs', (L, L, L), NO_ARGS, '', (True,)),


wait, are these being used for benchmarking? These tests are really slow when run under test_autograd and they don't help correctness.

I can change them back. I'm just trying to get them to be large enough to make sure they down the vector path

Well we should definitely have some tests for the vector path

We can just make it a slow test (and maybe only the forward?).

My decision here is to just leave common_method_invocations as it is on master and rely on the test_torch tests to ensure the functionality of this PR

gchanan · 2019-04-09T15:47:03Z

  types:
    - floating_point
  backends:
-    - CPU


why can't you kill the entire entry instead of just the CPU one? How are these called?

They're called from the stubs in CUDAUnaryOps.cpp

~~Actually turns out they're not so I'm gonna delete these~~

facebook-github-bot

@jamesr66a is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: I've been messing around with vectorizing the fusion compiler in JIT, and noticed that these ops were pathologically slow. I moved them to use TensorIterator + Vec256<> and got some speed wins. Benchmark script: ``` import torch, time ops = ['abs', 'neg', 'reciprocal', 'frac'] x = torch.rand(1024, 1024) NITER = 10000 print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t') for op in ops: s = time.time() for i in range(NITER): getattr(x, op)() elapsed_sec = ((time.time() - s) / NITER) print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t') ``` Before this change (on my mac with a skylake): ``` op time per iter (ms) gops/s GB/s abs 0.9730974197387695 1.0775652866097343 8.620522292877874 neg 1.0723679780960083 0.9778136063534356 7.822508850827485 reciprocal 1.2610594034194946 0.8315040490215421 6.6520323921723366 frac 1.1681334018707275 0.8976509004200546 7.181207203360437 ``` After this change: ``` op time per iter (ms) gops/s GB/s abs 0.5031076192855835 2.084198210889721 16.673585687117768 neg 0.4433974027633667 2.3648672578256087 18.91893806260487 reciprocal 0.47145988941192624 2.2241043693195985 17.79283495455679 frac 0.5036592721939087 2.0819154096627024 16.65532327730162 ``` So, after this change it looks like we are hitting machine peak for bandwidth and are bandwidth bound. Pull Request resolved: pytorch/pytorch#19041 Differential Revision: D14862037 Pulled By: jamesr66a fbshipit-source-id: e2032ac0ca962dbf4120bb36812277c260e22912

facebook-github-bot · 2019-04-10T07:08:09Z

@jamesr66a merged this pull request in 82b5705.

Summary: I've been messing around with vectorizing the fusion compiler in JIT, and noticed that these ops were pathologically slow. I moved them to use TensorIterator + Vec256<> and got some speed wins. Benchmark script: ``` import torch, time ops = ['abs', 'neg', 'reciprocal', 'frac'] x = torch.rand(1024, 1024) NITER = 10000 print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t') for op in ops: s = time.time() for i in range(NITER): getattr(x, op)() elapsed_sec = ((time.time() - s) / NITER) print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t') ``` Before this change (on my mac with a skylake): ``` op time per iter (ms) gops/s GB/s abs 0.9730974197387695 1.0775652866097343 8.620522292877874 neg 1.0723679780960083 0.9778136063534356 7.822508850827485 reciprocal 1.2610594034194946 0.8315040490215421 6.6520323921723366 frac 1.1681334018707275 0.8976509004200546 7.181207203360437 ``` After this change: ``` op time per iter (ms) gops/s GB/s abs 0.5031076192855835 2.084198210889721 16.673585687117768 neg 0.4433974027633667 2.3648672578256087 18.91893806260487 reciprocal 0.47145988941192624 2.2241043693195985 17.79283495455679 frac 0.5036592721939087 2.0819154096627024 16.65532327730162 ``` So, after this change it looks like we are hitting machine peak for bandwidth and are bandwidth bound. Pull Request resolved: pytorch#19041 Differential Revision: D14862037 Pulled By: jamesr66a fbshipit-source-id: e2032ac0ca962dbf4120bb36812277c260e22912

Summary: This is a follow up on Jame's PR: #19041. The idea is to replace the legacy `sinh` / `cosh` ops that are being dispatched to TH with the operations defined in `Vec256` for better performance. benchmark(from Jame's script): ```python import torch, time ops = ['sinh', 'cosh'] x = torch.rand(1024, 1024) NITER = 10000 print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t') for op in ops: s = time.time() for i in range(NITER): getattr(x, op)() elapsed_sec = ((time.time() - s) / NITER) print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t') ``` code on master: ``` op time per iter (ms) gops/s GB/s sinh 3.37614369392395 0.3105839369002935 2.484671495202348 cosh 3.480502033233643 0.3012714803748572 2.4101718429988574 ``` after change (on Macbook pro 2018): ``` op time per iter (ms) gops/s GB/s sinh 0.8956503868103027 1.1707425301677301 9.365940241341841 cosh 0.9392147302627564 1.1164390487217428 8.931512389773943 ``` Pull Request resolved: #21115 Reviewed By: ljk53 Differential Revision: D15574580 Pulled By: xta0 fbshipit-source-id: 392546a0df11ed4f0945f2bc84bf5dea2750b60e

Summary: This is a follow up on Jame's PR: pytorch/pytorch#19041. The idea is to replace the legacy `sinh` / `cosh` ops that are being dispatched to TH with the operations defined in `Vec256` for better performance. benchmark(from Jame's script): ```python import torch, time ops = ['sinh', 'cosh'] x = torch.rand(1024, 1024) NITER = 10000 print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t') for op in ops: s = time.time() for i in range(NITER): getattr(x, op)() elapsed_sec = ((time.time() - s) / NITER) print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t') ``` code on master: ``` op time per iter (ms) gops/s GB/s sinh 3.37614369392395 0.3105839369002935 2.484671495202348 cosh 3.480502033233643 0.3012714803748572 2.4101718429988574 ``` after change (on Macbook pro 2018): ``` op time per iter (ms) gops/s GB/s sinh 0.8956503868103027 1.1707425301677301 9.365940241341841 cosh 0.9392147302627564 1.1164390487217428 8.931512389773943 ``` Pull Request resolved: pytorch/pytorch#21115 Reviewed By: ljk53 Differential Revision: D15574580 Pulled By: xta0 fbshipit-source-id: 392546a0df11ed4f0945f2bc84bf5dea2750b60e

Summary: I've been messing around with vectorizing the fusion compiler in JIT, and noticed that these ops were pathologically slow. I moved them to use TensorIterator + Vec256<> and got some speed wins. Benchmark script: ``` import torch, time ops = ['abs', 'neg', 'reciprocal', 'frac'] x = torch.rand(1024, 1024) NITER = 10000 print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t') for op in ops: s = time.time() for i in range(NITER): getattr(x, op)() elapsed_sec = ((time.time() - s) / NITER) print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t') ``` Before this change (on my mac with a skylake): ``` op time per iter (ms) gops/s GB/s abs 0.9730974197387695 1.0775652866097343 8.620522292877874 neg 1.0723679780960083 0.9778136063534356 7.822508850827485 reciprocal 1.2610594034194946 0.8315040490215421 6.6520323921723366 frac 1.1681334018707275 0.8976509004200546 7.181207203360437 ``` After this change: ``` op time per iter (ms) gops/s GB/s abs 0.5031076192855835 2.084198210889721 16.673585687117768 neg 0.4433974027633667 2.3648672578256087 18.91893806260487 reciprocal 0.47145988941192624 2.2241043693195985 17.79283495455679 frac 0.5036592721939087 2.0819154096627024 16.65532327730162 ``` So, after this change it looks like we are hitting machine peak for bandwidth and are bandwidth bound. Pull Request resolved: pytorch#19041 Differential Revision: D14862037 Pulled By: jamesr66a fbshipit-source-id: e2032ac0ca962dbf4120bb36812277c260e22912

…ch#21115) Summary: This is a follow up on Jame's PR: pytorch#19041. The idea is to replace the legacy `sinh` / `cosh` ops that are being dispatched to TH with the operations defined in `Vec256` for better performance. benchmark(from Jame's script): ```python import torch, time ops = ['sinh', 'cosh'] x = torch.rand(1024, 1024) NITER = 10000 print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t') for op in ops: s = time.time() for i in range(NITER): getattr(x, op)() elapsed_sec = ((time.time() - s) / NITER) print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t') ``` code on master: ``` op time per iter (ms) gops/s GB/s sinh 3.37614369392395 0.3105839369002935 2.484671495202348 cosh 3.480502033233643 0.3012714803748572 2.4101718429988574 ``` after change (on Macbook pro 2018): ``` op time per iter (ms) gops/s GB/s sinh 0.8956503868103027 1.1707425301677301 9.365940241341841 cosh 0.9392147302627564 1.1164390487217428 8.931512389773943 ``` Pull Request resolved: pytorch#21115 Reviewed By: ljk53 Differential Revision: D15574580 Pulled By: xta0 fbshipit-source-id: 392546a0df11ed4f0945f2bc84bf5dea2750b60e

jamesr66a requested review from colesbury, cpuhrsch and gchanan April 8, 2019 22:03

Move abs, frac, reciprocal, and neg to TensorIterator

3d0664f

jamesr66a force-pushed the unary_block branch from 88e516f to 3d0664f Compare April 8, 2019 22:34

cuda build

52fa6a7

cpuhrsch reviewed Apr 8, 2019

View reviewed changes

cpuhrsch reviewed Apr 9, 2019

View reviewed changes

jamesr66a force-pushed the unary_block branch from 795d828 to 36c8385 Compare April 9, 2019 00:40

address comments

b05aa71

jamesr66a force-pushed the unary_block branch from 36c8385 to b05aa71 Compare April 9, 2019 00:43

cpuhrsch reviewed Apr 9, 2019

View reviewed changes

James Reed added 2 commits April 8, 2019 20:39

fix some small stuff

4557603

Modify tests

076729b

gchanan requested changes Apr 9, 2019

View reviewed changes

address some comments

6bd7ef8

jamesr66a force-pushed the unary_block branch from 832befe to 6bd7ef8 Compare April 9, 2019 19:58

ezyang added oncall: jit Add this issue/PR to JIT oncall triage queue module: vectorization Related to SIMD vectorization, e.g., Vec256 labels Apr 9, 2019

gchanan approved these changes Apr 9, 2019

View reviewed changes

facebook-github-bot reviewed Apr 9, 2019

View reviewed changes

jamesr66a mentioned this pull request Apr 10, 2019

[PyTorch] Unary Operator Vectorization #19088

Closed

9 tasks

facebook-github-bot closed this in 82b5705 Apr 10, 2019

facebook-github-bot added the merged label Apr 10, 2019

xta0 mentioned this pull request May 30, 2019

Move legacy TH functions(sinh,cosh) to TensorIterator + Vec256 #21115

Closed

fmassa mentioned this pull request Jul 15, 2019

Port sign operator from the TH code to Aten #22806

Closed


		// Negation. Defined here so we can utilize operator-

		Vec256<int64_t> Vec256<int64_t>::neg() const {

Conversation

jamesr66a commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpuhrsch commented Apr 8, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpuhrsch commented Apr 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesr66a Apr 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jamesr66a commented Apr 8, 2019 •

edited

Loading

jamesr66a Apr 9, 2019 •

edited

Loading