Codestin Search App

VitalyFedyunin · 2019-03-29T16:42:14Z

Replace cpu_apply functions with the TensorIterator.
Vectorize copy and clone functions.
Move big pieces of the code to cpu kernels folder to be able to use AVX2.
Add fast path for copy_ function if tensor types matches.

Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost).

…ace_cpu_apply

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

cpuhrsch · 2019-04-16T17:32:51Z

  auto builder = TensorIterator::Builder();
  builder.add_output(out);
  builder.add_input(a);
+  if (!resize_outputs) {


Down the road, is this something we want to remove again? This can be part of a wider discussion around operator contracts.

This is topic for larger discussion. I assume copy should only accept same shapes of src dst (as it was before), but if we want to have broadcasting here, it is also possible as soon as we rewrite TH code.

…ace_cpu_apply

watch: pytorch#19345 - same as this one pytorch#18618 - use tensor iterator

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin · 2019-04-18T15:13:43Z

@pytorchbot retest this please

…ace_cpu_apply

This reverts commit 72538dd.

…ace_cpu_apply

VitalyFedyunin · 2019-04-19T21:34:09Z

Move done as well as cleanup. I still planning to look what is the deal with the Scalar and do benchmarks.

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…ace_cpu_apply

VitalyFedyunin · 2019-04-22T19:14:23Z

Done with all changes, please review.
PS. Going to publish benchmarks here too (today).

cpuhrsch · 2019-04-22T19:16:46Z

+  if (self.scalar_type() == src.scalar_type()) {
+    copy_kernel_same_type(kCPU, self, src);
+  } else {
+    AT_CHECK(self.numel() == src.numel(), "sizes do not match");


Numel doesn't give you all info about the size. I'm not sure what a better word might be.

It is copy of the current aten/src/ATen/native/Copy.cpp:23 . I have no better wording for it.

Also "sizes do not match" is a bad error message. We should mention what the shapes were, and what't the problem.

This should probably be an AT_ASSERT, but I don't think it's super important. We do a proper shape check before this in copy_ via expand_inplace.

cpuhrsch · 2019-04-22T19:17:26Z


-  parallel_for(0, dst.nbytes(), COPY_GRAIN_SIZE, sample);
+  if (self.scalar_type() == at::ScalarType::Half) {
+    unary_kernel(*iter, [=](at::Half a) -> at::Half { return a; });


You might be able to use int16 here and vectorize.

cpuhrsch · 2019-04-22T19:20:04Z


+// output is contiguous, arg1 is scalar
+template <typename traits>
+static inline bool is_unary_contiguous_s1(const int64_t* strides) {


Might this be more accurately described by a nullary operation as about to be added via #18876

Yes, and it will save me from the writing vectorized loop for scalar case. Right now it is unnecessary as we never broadcast during the copy. But I look forward to both PRs to be landed to complete vec.

Right now it is unnecessary as we never broadcast during the copy

We support broadcasting in copy-operations (on the src but not dst tensor). We should have a test case for this, but you should double-check that it still works.

This is not quite the same as nullary (zero-input) operation, although they are somewhat similar. For example:

x = torch.randn(8, 1024) y = torch.randn(8, 1) x.copy_(y)

will use this case. (Unlike fill_, y is a Tensor here)

I should add that the broadcasting happens in between Tensor::copy_ and Tensor::s_copy_ by some auto-generated code. The copy code doesn't handle it explicitly. (Maybe that's what you meant?)

A lot of the functions that use TensorIterator no longer use this pattern (there's no longer an s_add or s_mul) because TensorIterator handles broadcasting and shape checks. Eventually we can do that for copy_ (it'll reduce some boilerplate) but that will come later.

I should add that the broadcasting happens in between Tensor::copy_ and Tensor::s_copy_ by some auto-generated code. The copy code doesn't handle it explicitly. (Maybe that's what you meant?)

Yes. It is actually happening inside of the _copy by

std::tie(b_src) = expand_inplace(self, src, "copy");

A lot of the functions that use TensorIterator no longer use this pattern (there's no longer an s_add or s_mul) because TensorIterator handles broadcasting and shape checks. Eventually we can do that for copy_ (it'll reduce some boilerplate) but that will come later.

Agree, as soon as I check transpose edge case (as I suspect it to be slower than vectorized TensorIterator) we can remove resize and leave broadcasting to TensorIterator.

cpuhrsch · 2019-04-22T19:36:14Z

Accepted under the assumption that nullary ops will be implemented and replace the unary_s1 case.

colesbury · 2019-04-22T20:55:09Z


+// output is contiguous, arg1 is scalar
+template <typename traits>
+static inline bool is_unary_contiguous_s1(const int64_t* strides) {


Right now it is unnecessary as we never broadcast during the copy

We support broadcasting in copy-operations (on the src but not dst tensor). We should have a test case for this, but you should double-check that it still works.

This is not quite the same as nullary (zero-input) operation, although they are somewhat similar. For example:

x = torch.randn(8, 1024) y = torch.randn(8, 1) x.copy_(y)

will use this case. (Unlike fill_, y is a Tensor here)

colesbury · 2019-04-22T21:06:05Z

-constexpr int64_t COPY_GRAIN_SIZE = 20000;
+template <typename self_T>
+void copy_kernel_cast_t_impl(Tensor& self, const Tensor& src) {
+  auto builder = TensorIterator::Builder();


I think it's a bit better to construct the TensorIterator once in _s_copy__cpu and pass it as an argument (instead of passing the Tensors as arguments). The reasons for this are:

It's a step towards centralizing error checks (reduces duplicate code and risk of missing shape or dtype checks for certain cases)

It better matches the pattern of the other kernels and people are likely to use this as an example

On the second thought, right now TensorIterator construction and loop are staying together, and Copy.cpp doesn't need to know anything about implementation details of the kernel. Wich perfectly isolates implementation from dispatch and allows us to replace kernels (if necessary) at ease.

colesbury · 2019-04-22T21:06:08Z

-    memcpy(self_seg, src_seg, len);
-  };
+static void copy_kernel_same_type_impl(Tensor& self, const Tensor& src) {
+  auto builder = TensorIterator::Builder();


same here (see above comment)

…ace_cpu_apply

VitalyFedyunin · 2019-04-23T19:31:33Z

Single threaded:
OMP_NUM_THREADS=1 sudo nice --20 numactl --membind=0 --cpubind=0 taskset -c 0 /private/home/vitalyf/anaconda3/envs/py3/bin/python a.py

Before optimization:

torch.Size([100, 100, 100])
torch.DoubleTensor
times: [16.76904845237732, 16.771162509918213, 16.771618604660034, 16.77182149887085, 16.772063732147217, 16.772103309631348, 16.77212142944336, 16.772297382354736, 16.79311442375183, 16.815782070159912]
mean : 16.778113341331483
std  : 0.014137423384807224

After optimization:

torch.Size([100, 100, 100])
torch.DoubleTensor
times: [11.295104503631592, 11.295217275619507, 11.30241322517395, 11.303941249847412, 11.308543682098389, 11.311513423919678, 11.328369140625, 11.337308645248413, 11.349253416061401, 11.424124956130981]
mean : 11.325578951835633
std  : 0.03709524226320387

About 33% perf improvement on bigger tensors, all data types.

Multithreaded:

Before

torch.Size([100, 100, 100])
torch.DoubleTensor
times: [0.978071928024292, 0.97812819480896, 0.9785230159759521, 0.9785904884338379, 0.9786961078643799, 0.9792697429656982, 0.9793672561645508, 0.9795773029327393, 0.9799015522003174, 0.9880003929138184]
mean : 0.9798125982284546
std  : 0.002790457901376675

After

torch.Size([100, 100, 100])
torch.DoubleTensor
times: [0.7371759414672852, 0.7373862266540527, 0.7375473976135254, 0.7377481460571289, 0.7379317283630371, 0.7379584312438965, 0.7388496398925781, 0.740281343460083, 0.7441291809082031, 0.7923455238342285]
mean : 0.7441353559494018
std  : 0.01619207759240831

About 25% improvement.

Testing script:

import numpy as np
import torch

x = torch.randn((100, 100, 100), dtype=torch.double)
z = x.permute(0, 2, 1)

times = []
for i in range(10):
    print(i)
    import time
    a = time.time()
    for _ in range(10000):
        z.clone()
    times.append(time.time() - a)
times = np.array(times)
print("")
print(x.size())
print(x.type())
print("times: " + str(times))
print("mean : " + str(np.mean(times)))
print("std  : " + str(np.std(times)))

Numbers are reproducible between multiple invocations.

Longer pytorch/benchmarks are running now.

VitalyFedyunin · 2019-04-24T19:22:29Z

BEFORE:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Job number   ETA (mm:ss)         Benchmark       Time mean (us)        Time std (us)        CPU mean (us)         CPU std (us)      Iter. mean             Rep.   mag   dim    cont          function           dtype   trans   framework                          strides                            sizes
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
           1/32        78:33       CPUCopyBench                    4                    0                    4                    0         1008132               30     1     3    True   ('copy_', None)   torch.float32   False       Torch                        (4, 2, 1)            torch.Size([2, 2, 2])
           2/32        76:00       CPUCopyBench                    4                    0                    4                    0         1008481               30     1     3    True   ('copy_', None)   torch.float64   False       Torch                        (4, 2, 1)            torch.Size([2, 2, 2])
           3/32        73:28       CPUCopyBench                    4                    0                    4                    0         1001868               30     1     3   False   ('copy_', None)   torch.float32   False       Torch                     (72, 36, 18)            torch.Size([2, 2, 2])
           4/32        70:56       CPUCopyBench                    4                    0                    4                    0         1004311               30     1     3   False   ('copy_', None)   torch.float64   False       Torch                     (72, 36, 18)            torch.Size([2, 2, 2])
           5/32        68:24       CPUCopyBench                    6                    0                    6                    0          824922               30     3     3    True   ('copy_', None)   torch.float32   False       Torch                       (81, 9, 1)            torch.Size([9, 9, 9])
           6/32        65:52       CPUCopyBench                    6                    0                    6                    0          820990               30     3     3    True   ('copy_', None)   torch.float64   False       Torch                       (81, 9, 1)            torch.Size([9, 9, 9])
           7/32        63:20       CPUCopyBench                    7                    0                    7                    0          695782               30     3     3   False   ('copy_', None)   torch.float32   False       Torch                  (1458, 162, 18)            torch.Size([9, 9, 9])
           8/32        60:48       CPUCopyBench                    7                    0                    7                    0          658744               30     3     3   False   ('copy_', None)   torch.float64   False       Torch                  (1458, 162, 18)            torch.Size([9, 9, 9])
           9/32        58:16       CPUCopyBench                 1338                    0                 1338                    0            3734               30     6     3    True   ('copy_', None)   torch.float32   False       Torch                    (9801, 99, 1)         torch.Size([99, 99, 99])
          10/32        55:44       CPUCopyBench                 1342                    0                 1341                    0            3725               30     6     3    True   ('copy_', None)   torch.float64   False       Torch                    (9801, 99, 1)         torch.Size([99, 99, 99])
          11/32        53:13       CPUCopyBench                 8991                   24                 8987                   24             556               30     6     3   False   ('copy_', None)   torch.float32   False       Torch               (176418, 1782, 18)         torch.Size([99, 99, 99])
          12/32        50:42       CPUCopyBench                13484                   45                13479                   45             371               30     6     3   False   ('copy_', None)   torch.float64   False       Torch               (176418, 1782, 18)         torch.Size([99, 99, 99])
          13/32        48:11       CPUCopyBench                13953                    3                13947                    3             359               30     7     3    True   ('copy_', None)   torch.float32   False       Torch                  (46225, 215, 1)      torch.Size([215, 215, 215])
          14/32        45:39       CPUCopyBench                18650                   33                18642                   33             268               30     7     3    True   ('copy_', None)   torch.float64   False       Torch                  (46225, 215, 1)      torch.Size([215, 215, 215])
          15/32        43:10       CPUCopyBench               100410                  278               100369                  277              50               30     7     3   False   ('copy_', None)   torch.float32   False       Torch               (832050, 3870, 18)      torch.Size([215, 215, 215])
          16/32        40:43       CPUCopyBench               140376                  250               140320                  249              36               30     7     3   False   ('copy_', None)   torch.float64   False       Torch               (832050, 3870, 18)      torch.Size([215, 215, 215])
          17/32        38:09       CPUCopyBench                    5                    0                    5                    0          999073               30     1     3    True   ('copy_', None)   torch.float32    True       Torch                        (2, 4, 1)            torch.Size([2, 2, 2])
          18/32        35:36       CPUCopyBench                    4                    0                    4                    0         1002029               30     1     3    True   ('copy_', None)   torch.float64    True       Torch                        (2, 4, 1)            torch.Size([2, 2, 2])
          19/32        33:03       CPUCopyBench                    5                    0                    5                    0          992863               30     1     3   False   ('copy_', None)   torch.float32    True       Torch                     (36, 72, 18)            torch.Size([2, 2, 2])
          20/32        30:30       CPUCopyBench                    4                    0                    4                    0         1000780               30     1     3   False   ('copy_', None)   torch.float64    True       Torch                     (36, 72, 18)            torch.Size([2, 2, 2])
          21/32        27:58       CPUCopyBench                    6                    0                    6                    0          736492               30     3     3    True   ('copy_', None)   torch.float32    True       Torch                       (9, 81, 1)            torch.Size([9, 9, 9])
          22/32        25:25       CPUCopyBench                    6                    0                    6                    0          740196               30     3     3    True   ('copy_', None)   torch.float64    True       Torch                       (9, 81, 1)            torch.Size([9, 9, 9])
          23/32        22:52       CPUCopyBench                    7                    0                    7                    0          668404               30     3     3   False   ('copy_', None)   torch.float32    True       Torch                  (162, 1458, 18)            torch.Size([9, 9, 9])
          24/32        20:19       CPUCopyBench                    7                    0                    7                    0          626416               30     3     3   False   ('copy_', None)   torch.float64    True       Torch                  (162, 1458, 18)            torch.Size([9, 9, 9])
          25/32        17:47       CPUCopyBench                 1494                    8                 1493                    8            3346               30     6     3    True   ('copy_', None)   torch.float32    True       Torch                    (99, 9801, 1)         torch.Size([99, 99, 99])
          26/32        15:14       CPUCopyBench                 1503                    5                 1502                    5            3326               30     6     3    True   ('copy_', None)   torch.float64    True       Torch                    (99, 9801, 1)         torch.Size([99, 99, 99])
          27/32        12:42       CPUCopyBench                 9302                   25                 9298                   25             537               30     6     3   False   ('copy_', None)   torch.float32    True       Torch               (1782, 176418, 18)         torch.Size([99, 99, 99])
          28/32        10:09       CPUCopyBench                14004                   36                13999                   36             357               30     6     3   False   ('copy_', None)   torch.float64    True       Torch               (1782, 176418, 18)         torch.Size([99, 99, 99])
          29/32         7:37       CPUCopyBench                14893                   46                14887                   46             336               30     7     3    True   ('copy_', None)   torch.float32    True       Torch                  (215, 46225, 1)      torch.Size([215, 215, 215])
          30/32         5:04       CPUCopyBench                19426                   57                19418                   57             258               30     7     3    True   ('copy_', None)   torch.float64    True       Torch                  (215, 46225, 1)      torch.Size([215, 215, 215])
          31/32         2:32       CPUCopyBench               101009                  254               100968                  253              50               30     7     3   False   ('copy_', None)   torch.float32    True       Torch               (3870, 832050, 18)      torch.Size([215, 215, 215])
          32/32         0:00       CPUCopyBench               141436                  529               141380                  530              35               30     7     3   False   ('copy_', None)   torch.float64    True       Torch               (3870, 832050, 18)      torch.Size([215, 215, 215])

AFTER:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Job number   ETA (mm:ss)         Benchmark       Time mean (us)        Time std (us)        CPU mean (us)         CPU std (us)      Iter. mean             Rep.          function   trans    cont   mag           dtype   dim   framework                          strides                            sizes
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
           1/32        78:34       CPUCopyBench                    5                    0                    5                    0          957369               30   ('copy_', None)   False    True     1   torch.float32     3       Torch                        (4, 2, 1)            torch.Size([2, 2, 2])
           2/32        76:02       CPUCopyBench                    5                    0                    5                    0          904700               30   ('copy_', None)    True    True     1   torch.float32     3       Torch                        (2, 4, 1)            torch.Size([2, 2, 2])
           3/32        73:30       CPUCopyBench                    5                    0                    5                    0          957313               30   ('copy_', None)   False    True     1   torch.float64     3       Torch                        (4, 2, 1)            torch.Size([2, 2, 2])
           4/32        70:58       CPUCopyBench                    5                    0                    5                    0          897145               30   ('copy_', None)    True    True     1   torch.float64     3       Torch                        (2, 4, 1)            torch.Size([2, 2, 2])
           5/32        68:26       CPUCopyBench                    5                    0                    5                    0          922954               30   ('copy_', None)   False    True     3   torch.float32     3       Torch                       (81, 9, 1)            torch.Size([9, 9, 9])
           6/32        65:54       CPUCopyBench                    7                    0                    7                    0          641419               30   ('copy_', None)    True    True     3   torch.float32     3       Torch                       (9, 81, 1)            torch.Size([9, 9, 9])
           7/32        63:22       CPUCopyBench                    5                    0                    5                    0          877614               30   ('copy_', None)   False    True     3   torch.float64     3       Torch                       (81, 9, 1)            torch.Size([9, 9, 9])
           8/32        60:49       CPUCopyBench                    7                    0                    7                    0          667641               30   ('copy_', None)    True    True     3   torch.float64     3       Torch                       (9, 81, 1)            torch.Size([9, 9, 9])
           9/32        58:17       CPUCopyBench                  397                    0                  397                    0           12574               30   ('copy_', None)   False    True     6   torch.float32     3       Torch                    (9801, 99, 1)         torch.Size([99, 99, 99])
          10/32        55:45       CPUCopyBench                  528                    0                  528                    0            9462               30   ('copy_', None)    True    True     6   torch.float32     3       Torch                    (99, 9801, 1)         torch.Size([99, 99, 99])
          11/32        53:13       CPUCopyBench                  784                    0                  784                    0            6374               30   ('copy_', None)   False    True     6   torch.float64     3       Torch                    (9801, 99, 1)         torch.Size([99, 99, 99])
          12/32        50:41       CPUCopyBench                  900                    0                  899                    0            5554               30   ('copy_', None)    True    True     6   torch.float64     3       Torch                    (99, 9801, 1)         torch.Size([99, 99, 99])
          13/32        48:10       CPUCopyBench                 8531                   25                 8527                   25             586               30   ('copy_', None)   False    True     7   torch.float32     3       Torch                  (46225, 215, 1)      torch.Size([215, 215, 215])
          14/32        45:38       CPUCopyBench                 9581                   15                 9577                   15             522               30   ('copy_', None)    True    True     7   torch.float32     3       Torch                  (215, 46225, 1)      torch.Size([215, 215, 215])
          15/32        43:07       CPUCopyBench                18342                   19                18335                   19             273               30   ('copy_', None)   False    True     7   torch.float64     3       Torch                  (46225, 215, 1)      torch.Size([215, 215, 215])
          16/32        40:35       CPUCopyBench                19461                   18                19453                   18             257               30   ('copy_', None)    True    True     7   torch.float64     3       Torch                  (215, 46225, 1)      torch.Size([215, 215, 215])
          17/32        38:03       CPUCopyBench                    5                    0                    5                    0          952869               30   ('copy_', None)   False   False     1   torch.float32     3       Torch                     (72, 36, 18)            torch.Size([2, 2, 2])
          18/32        35:30       CPUCopyBench                    5                    0                    5                    0          900599               30   ('copy_', None)    True   False     1   torch.float32     3       Torch                     (36, 72, 18)            torch.Size([2, 2, 2])
          19/32        32:58       CPUCopyBench                    5                    0                    5                    0          950621               30   ('copy_', None)   False   False     1   torch.float64     3       Torch                     (72, 36, 18)            torch.Size([2, 2, 2])
          20/32        30:26       CPUCopyBench                    5                    0                    5                    0          907684               30   ('copy_', None)    True   False     1   torch.float64     3       Torch                     (36, 72, 18)            torch.Size([2, 2, 2])
          21/32        27:53       CPUCopyBench                    7                    0                    7                    0          653460               30   ('copy_', None)   False   False     3   torch.float32     3       Torch                  (1458, 162, 18)            torch.Size([9, 9, 9])
          22/32        25:21       CPUCopyBench                    8                    0                    8                    0          589592               30   ('copy_', None)    True   False     3   torch.float32     3       Torch                  (162, 1458, 18)            torch.Size([9, 9, 9])
          23/32        22:49       CPUCopyBench                    8                    0                    8                    0          606024               30   ('copy_', None)   False   False     3   torch.float64     3       Torch                  (1458, 162, 18)            torch.Size([9, 9, 9])
          24/32        20:17       CPUCopyBench                    9                    0                    9                    0          553237               30   ('copy_', None)    True   False     3   torch.float64     3       Torch                  (162, 1458, 18)            torch.Size([9, 9, 9])
          25/32        17:45       CPUCopyBench                 9014                   23                 9011                   23             555               30   ('copy_', None)   False   False     6   torch.float32     3       Torch               (176418, 1782, 18)         torch.Size([99, 99, 99])
          26/32        15:13       CPUCopyBench                 9287                   20                 9283                   20             538               30   ('copy_', None)    True   False     6   torch.float32     3       Torch               (1782, 176418, 18)         torch.Size([99, 99, 99])
          27/32        12:41       CPUCopyBench                13255                   29                13250                   29             377               30   ('copy_', None)   False   False     6   torch.float64     3       Torch               (176418, 1782, 18)         torch.Size([99, 99, 99])
          28/32        10:08       CPUCopyBench                13853                   25                13847                   25             361               30   ('copy_', None)    True   False     6   torch.float64     3       Torch               (1782, 176418, 18)         torch.Size([99, 99, 99])
          29/32         7:37       CPUCopyBench                99736                  191                99695                  190              51               30   ('copy_', None)   False   False     7   torch.float32     3       Torch               (832050, 3870, 18)      torch.Size([215, 215, 215])
          30/32         5:05       CPUCopyBench               100864                  164               100822                  163              50               30   ('copy_', None)    True   False     7   torch.float32     3       Torch               (3870, 832050, 18)      torch.Size([215, 215, 215])
          31/32         2:32       CPUCopyBench               136722                  284               136666                  283              37               30   ('copy_', None)   False   False     7   torch.float64     3       Torch               (832050, 3870, 18)      torch.Size([215, 215, 215])
          32/32         0:00       CPUCopyBench               138618                  284               138560                  282              37               30   ('copy_', None)    True   False     7   torch.float64     3       Torch               (3870, 832050, 18)      torch.Size([215, 215, 215])

CONCLUSION:

Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost).

cpuhrsch · 2019-04-24T19:54:04Z

@VitalyFedyunin - I'd say the slowdown on the small Tensors is fine, given that constant overhead is a separate topic in itself that we need to tackle holistically. Otherwise this looks great from a performance perspective.

facebook-github-bot

@VitalyFedyunin is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Replace cpu_apply functions with the TensorIterator. Vectorize copy and clone functions. Move big pieces of the code to cpu kernels folder to be able to use AVX2. Add fast path for copy_ function if tensor types matches. Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost). Pull Request resolved: pytorch/pytorch#18618 Differential Revision: D14954118 Pulled By: VitalyFedyunin fbshipit-source-id: 9d9bdf3fd9d5e539a03071cced50d0a47bac1615

facebook-github-bot · 2019-04-25T19:07:09Z

@VitalyFedyunin merged this pull request in 465799f.

apaszke

Are we doing anything to prevent the slowdown for small tensors? This is as important as the large-tensor perf.

apaszke · 2019-04-30T10:23:57Z

+  if (self.scalar_type() == src.scalar_type()) {
+    copy_kernel_same_type(kCPU, self, src);
+  } else {
+    AT_CHECK(self.numel() == src.numel(), "sizes do not match");


Also "sizes do not match" is a bad error message. We should mention what the shapes were, and what't the problem.

colesbury · 2019-04-30T14:11:18Z

@apaszke I'm working on refactoring the copy dispatch, which I think will reduce the overhead of this op. I don't think anyone is working on the broader problem of overhead across ops.

VitalyFedyunin · 2019-04-30T21:20:50Z

I'm planning to look at TensorIterator and smaller tensors problem. But it is about one month from now.

resistor · 2019-05-01T18:37:11Z

@VitalyFedyunin You just undid my change from #19198 , eliminating type dispatch from the same-type-path, which tested both faster and smaller code size for me. It'd be nice to be pinged when you're planning to undo something I just did.

VitalyFedyunin · 2019-05-01T19:37:11Z

@resistor I did not eliminate same type dispatch, but indeed memcpy implementation #19198 was replaced by TensorIterator with Vec256, which is per tests is faster than the previous version of the code.

It it easy to replace it back. But I think until we get reliable benchmarking tool, it would be hunting shadows.

@colesbury @cpuhrsch any thoughts?

…h#18618) Summary: Replace cpu_apply functions with the TensorIterator. Vectorize copy and clone functions. Move big pieces of the code to cpu kernels folder to be able to use AVX2. Add fast path for copy_ function if tensor types matches. Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost). Pull Request resolved: pytorch#18618 Differential Revision: D14954118 Pulled By: VitalyFedyunin fbshipit-source-id: 9d9bdf3fd9d5e539a03071cced50d0a47bac1615

VitalyFedyunin added 5 commits March 29, 2019 09:41

Replace cpu_apply with TensorIterator

ff5d0df

Ruin sparse

5895015

Merge branch 'master' of https://github.com/pytorch/pytorch into repl…

74b9195

…ace_cpu_apply

Lets see if anything fails really

a1eece5

Added booleans and serial support

2a6d302

facebook-github-bot reviewed Apr 16, 2019

View reviewed changes

serial -> parallel

fffecde

VitalyFedyunin requested a review from cpuhrsch April 16, 2019 17:25

facebook-github-bot reviewed Apr 16, 2019

View reviewed changes

cpuhrsch reviewed Apr 16, 2019

View reviewed changes

Comment thread aten/src/ATen/native/Copy.cpp Outdated

cpuhrsch reviewed Apr 16, 2019

View reviewed changes

Comment thread aten/src/ATen/native/Copy.cpp Outdated

cpuhrsch reviewed Apr 16, 2019

View reviewed changes

Simplify Copy

bfddf1e

VitalyFedyunin changed the title ~~[WIP] Replace cpu_apply with TensorIterator~~ Replace cpu_apply with TensorIterator inside of Copy function Apr 16, 2019

VitalyFedyunin added 2 commits April 16, 2019 12:16

Merge branch 'master' of https://github.com/pytorch/pytorch into repl…

375ef7d

…ace_cpu_apply

Merge branch 'master' of https://github.com/pytorch/pytorch into repl…

c725b16

…ace_cpu_apply

fmassa mentioned this pull request Apr 17, 2019

Improve tensor copy performance on CPU device #19345

Closed

Merge branch 'master' of https://github.com/pytorch/pytorch into repl…

43f6401

…ace_cpu_apply

colesbury reviewed Apr 17, 2019

View reviewed changes

Comment thread aten/src/ATen/native/cpu/Loops.h Outdated

Comment thread aten/src/ATen/native/Copy.cpp Outdated

colesbury mentioned this pull request Apr 17, 2019

Enable vectorized dim repeat. #19276

Closed

mingfeima added a commit to mingfeima/pytorch that referenced this pull request Apr 18, 2019

parallel copy when non-contiguous

68dd99f

watch: pytorch#19345 - same as this one pytorch#18618 - use tensor iterator

facebook-github-bot reviewed Apr 18, 2019

View reviewed changes

VitalyFedyunin added 7 commits April 18, 2019 08:47

WIP

0f1fb2a

Merge branch 'master' of https://github.com/pytorch/pytorch into repl…

9e26be4

…ace_cpu_apply

More updates;

72538dd

Revert "More updates;"

4048bb2

This reverts commit 72538dd.

Supporting different types now

de3a789

Vec256 for copy

ad1b49e

Merge branch 'master' of https://github.com/pytorch/pytorch into repl…

1000d94

…ace_cpu_apply

facebook-github-bot reviewed Apr 19, 2019

View reviewed changes

VitalyFedyunin added 2 commits April 22, 2019 08:56

Merge branch 'master' of https://github.com/pytorch/pytorch into repl…

91cde44

…ace_cpu_apply

Add scalar optim.

6e8f528

cpuhrsch reviewed Apr 22, 2019

View reviewed changes

cpuhrsch approved these changes Apr 22, 2019

View reviewed changes

colesbury reviewed Apr 22, 2019

View reviewed changes

VitalyFedyunin mentioned this pull request Apr 23, 2019

improve performance of common CPU clone / contiguous calls with HPTT #3468

Open

VitalyFedyunin added 2 commits April 23, 2019 10:27

Add auto-vectorization trick

38006c7

Merge branch 'master' of https://github.com/pytorch/pytorch into repl…

4b943ab

…ace_cpu_apply

facebook-github-bot reviewed Apr 24, 2019

View reviewed changes

facebook-github-bot closed this in 465799f Apr 25, 2019

facebook-github-bot added the merged label Apr 25, 2019

apaszke reviewed Apr 30, 2019

View reviewed changes

Conversation

VitalyFedyunin commented Mar 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin commented Apr 18, 2019

Uh oh!

VitalyFedyunin commented Apr 19, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin commented Apr 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin Apr 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpuhrsch Apr 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpuhrsch commented Apr 22, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin commented Apr 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VitalyFedyunin commented Apr 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpuhrsch commented Apr 24, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 25, 2019

Uh oh!

VitalyFedyunin commented Mar 29, 2019 •

edited

Loading

VitalyFedyunin Apr 22, 2019 •

edited

Loading

cpuhrsch Apr 22, 2019 •

edited

Loading

VitalyFedyunin commented Apr 23, 2019 •

edited

Loading

VitalyFedyunin commented Apr 24, 2019 •

edited

Loading

VitalyFedyunin commented May 1, 2019 •

edited

Loading