Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Replace cpu_apply with TensorIterator inside of Copy function#18618

Closed
VitalyFedyunin wants to merge 28 commits into
pytorch:masterfrom
VitalyFedyunin:replace_cpu_apply
Closed

Replace cpu_apply with TensorIterator inside of Copy function#18618
VitalyFedyunin wants to merge 28 commits into
pytorch:masterfrom
VitalyFedyunin:replace_cpu_apply

Conversation

@VitalyFedyunin
Copy link
Copy Markdown
Contributor

@VitalyFedyunin VitalyFedyunin commented Mar 29, 2019

Replace cpu_apply functions with the TensorIterator.
Vectorize copy and clone functions.
Move big pieces of the code to cpu kernels folder to be able to use AVX2.
Add fast path for copy_ function if tensor types matches.

Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost).

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin VitalyFedyunin requested a review from cpuhrsch April 16, 2019 17:25
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Comment thread aten/src/ATen/native/Copy.cpp Outdated
Comment thread aten/src/ATen/native/Copy.cpp Outdated
Comment thread aten/src/ATen/native/TensorIterator.cpp Outdated
auto builder = TensorIterator::Builder();
builder.add_output(out);
builder.add_input(a);
if (!resize_outputs) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Down the road, is this something we want to remove again? This can be part of a wider discussion around operator contracts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is topic for larger discussion. I assume copy should only accept same shapes of src dst (as it was before), but if we want to have broadcasting here, it is also possible as soon as we rewrite TH code.

@VitalyFedyunin VitalyFedyunin changed the title [WIP] Replace cpu_apply with TensorIterator Replace cpu_apply with TensorIterator inside of Copy function Apr 16, 2019
Comment thread aten/src/ATen/native/cpu/Loops.h Outdated
Comment thread aten/src/ATen/native/Copy.cpp Outdated
mingfeima added a commit to mingfeima/pytorch that referenced this pull request Apr 18, 2019
watch:
pytorch#19345 - same as this one
pytorch#18618 - use tensor iterator
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Copy Markdown
Contributor Author

@pytorchbot retest this please

@VitalyFedyunin
Copy link
Copy Markdown
Contributor Author

Move done as well as cleanup. I still planning to look what is the deal with the Scalar and do benchmarks.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Copy Markdown
Contributor Author

Done with all changes, please review.
PS. Going to publish benchmarks here too (today).

if (self.scalar_type() == src.scalar_type()) {
copy_kernel_same_type(kCPU, self, src);
} else {
AT_CHECK(self.numel() == src.numel(), "sizes do not match");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numel doesn't give you all info about the size. I'm not sure what a better word might be.

Copy link
Copy Markdown
Contributor Author

@VitalyFedyunin VitalyFedyunin Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is copy of the current aten/src/ATen/native/Copy.cpp:23 . I have no better wording for it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also "sizes do not match" is a bad error message. We should mention what the shapes were, and what't the problem.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be an AT_ASSERT, but I don't think it's super important. We do a proper shape check before this in copy_ via expand_inplace.


parallel_for(0, dst.nbytes(), COPY_GRAIN_SIZE, sample);
if (self.scalar_type() == at::ScalarType::Half) {
unary_kernel(*iter, [=](at::Half a) -> at::Half { return a; });
Copy link
Copy Markdown
Contributor

@cpuhrsch cpuhrsch Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to use int16 here and vectorize.


// output is contiguous, arg1 is scalar
template <typename traits>
static inline bool is_unary_contiguous_s1(const int64_t* strides) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might this be more accurately described by a nullary operation as about to be added via #18876

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and it will save me from the writing vectorized loop for scalar case. Right now it is unnecessary as we never broadcast during the copy. But I look forward to both PRs to be landed to complete vec.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it is unnecessary as we never broadcast during the copy

We support broadcasting in copy-operations (on the src but not dst tensor). We should have a test case for this, but you should double-check that it still works.

This is not quite the same as nullary (zero-input) operation, although they are somewhat similar. For example:

x = torch.randn(8, 1024)
y = torch.randn(8, 1)
x.copy_(y)

will use this case. (Unlike fill_, y is a Tensor here)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add that the broadcasting happens in between Tensor::copy_ and Tensor::s_copy_ by some auto-generated code. The copy code doesn't handle it explicitly. (Maybe that's what you meant?)

A lot of the functions that use TensorIterator no longer use this pattern (there's no longer an s_add or s_mul) because TensorIterator handles broadcasting and shape checks. Eventually we can do that for copy_ (it'll reduce some boilerplate) but that will come later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add that the broadcasting happens in between Tensor::copy_ and Tensor::s_copy_ by some auto-generated code. The copy code doesn't handle it explicitly. (Maybe that's what you meant?)

Yes. It is actually happening inside of the _copy by

std::tie(b_src) = expand_inplace(self, src, "copy");

A lot of the functions that use TensorIterator no longer use this pattern (there's no longer an s_add or s_mul) because TensorIterator handles broadcasting and shape checks. Eventually we can do that for copy_ (it'll reduce some boilerplate) but that will come later.

Agree, as soon as I check transpose edge case (as I suspect it to be slower than vectorized TensorIterator) we can remove resize and leave broadcasting to TensorIterator.

@cpuhrsch
Copy link
Copy Markdown
Contributor

Accepted under the assumption that nullary ops will be implemented and replace the unary_s1 case.

Comment thread aten/src/ATen/native/cpu/Loops.h

// output is contiguous, arg1 is scalar
template <typename traits>
static inline bool is_unary_contiguous_s1(const int64_t* strides) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it is unnecessary as we never broadcast during the copy

We support broadcasting in copy-operations (on the src but not dst tensor). We should have a test case for this, but you should double-check that it still works.

This is not quite the same as nullary (zero-input) operation, although they are somewhat similar. For example:

x = torch.randn(8, 1024)
y = torch.randn(8, 1)
x.copy_(y)

will use this case. (Unlike fill_, y is a Tensor here)

constexpr int64_t COPY_GRAIN_SIZE = 20000;
template <typename self_T>
void copy_kernel_cast_t_impl(Tensor& self, const Tensor& src) {
auto builder = TensorIterator::Builder();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit better to construct the TensorIterator once in _s_copy__cpu and pass it as an argument (instead of passing the Tensors as arguments). The reasons for this are:

  • It's a step towards centralizing error checks (reduces duplicate code and risk of missing shape or dtype checks for certain cases)
  • It better matches the pattern of the other kernels and people are likely to use this as an example

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the second thought, right now TensorIterator construction and loop are staying together, and Copy.cpp doesn't need to know anything about implementation details of the kernel. Wich perfectly isolates implementation from dispatch and allows us to replace kernels (if necessary) at ease.

memcpy(self_seg, src_seg, len);
};
static void copy_kernel_same_type_impl(Tensor& self, const Tensor& src) {
auto builder = TensorIterator::Builder();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here (see above comment)

@VitalyFedyunin
Copy link
Copy Markdown
Contributor Author

VitalyFedyunin commented Apr 23, 2019

Single threaded:
OMP_NUM_THREADS=1 sudo nice --20 numactl --membind=0 --cpubind=0 taskset -c 0 /private/home/vitalyf/anaconda3/envs/py3/bin/python a.py

Before optimization:

torch.Size([100, 100, 100])
torch.DoubleTensor
times: [16.76904845237732, 16.771162509918213, 16.771618604660034, 16.77182149887085, 16.772063732147217, 16.772103309631348, 16.77212142944336, 16.772297382354736, 16.79311442375183, 16.815782070159912]
mean : 16.778113341331483
std  : 0.014137423384807224

After optimization:

torch.Size([100, 100, 100])
torch.DoubleTensor
times: [11.295104503631592, 11.295217275619507, 11.30241322517395, 11.303941249847412, 11.308543682098389, 11.311513423919678, 11.328369140625, 11.337308645248413, 11.349253416061401, 11.424124956130981]
mean : 11.325578951835633
std  : 0.03709524226320387

About 33% perf improvement on bigger tensors, all data types.

Multithreaded:

Before

torch.Size([100, 100, 100])
torch.DoubleTensor
times: [0.978071928024292, 0.97812819480896, 0.9785230159759521, 0.9785904884338379, 0.9786961078643799, 0.9792697429656982, 0.9793672561645508, 0.9795773029327393, 0.9799015522003174, 0.9880003929138184]
mean : 0.9798125982284546
std  : 0.002790457901376675

After

torch.Size([100, 100, 100])
torch.DoubleTensor
times: [0.7371759414672852, 0.7373862266540527, 0.7375473976135254, 0.7377481460571289, 0.7379317283630371, 0.7379584312438965, 0.7388496398925781, 0.740281343460083, 0.7441291809082031, 0.7923455238342285]
mean : 0.7441353559494018
std  : 0.01619207759240831

About 25% improvement.

Testing script:

import numpy as np
import torch

x = torch.randn((100, 100, 100), dtype=torch.double)
z = x.permute(0, 2, 1)

times = []
for i in range(10):
    print(i)
    import time
    a = time.time()
    for _ in range(10000):
        z.clone()
    times.append(time.time() - a)
times = np.array(times)
print("")
print(x.size())
print(x.type())
print("times: " + str(times))
print("mean : " + str(np.mean(times)))
print("std  : " + str(np.std(times)))

Numbers are reproducible between multiple invocations.

Longer pytorch/benchmarks are running now.

@VitalyFedyunin
Copy link
Copy Markdown
Contributor Author

VitalyFedyunin commented Apr 24, 2019

BEFORE:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Job number   ETA (mm:ss)         Benchmark       Time mean (us)        Time std (us)        CPU mean (us)         CPU std (us)      Iter. mean             Rep.   mag   dim    cont          function           dtype   trans   framework                          strides                            sizes
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
           1/32        78:33       CPUCopyBench                    4                    0                    4                    0         1008132               30     1     3    True   ('copy_', None)   torch.float32   False       Torch                        (4, 2, 1)            torch.Size([2, 2, 2])
           2/32        76:00       CPUCopyBench                    4                    0                    4                    0         1008481               30     1     3    True   ('copy_', None)   torch.float64   False       Torch                        (4, 2, 1)            torch.Size([2, 2, 2])
           3/32        73:28       CPUCopyBench                    4                    0                    4                    0         1001868               30     1     3   False   ('copy_', None)   torch.float32   False       Torch                     (72, 36, 18)            torch.Size([2, 2, 2])
           4/32        70:56       CPUCopyBench                    4                    0                    4                    0         1004311               30     1     3   False   ('copy_', None)   torch.float64   False       Torch                     (72, 36, 18)            torch.Size([2, 2, 2])
           5/32        68:24       CPUCopyBench                    6                    0                    6                    0          824922               30     3     3    True   ('copy_', None)   torch.float32   False       Torch                       (81, 9, 1)            torch.Size([9, 9, 9])
           6/32        65:52       CPUCopyBench                    6                    0                    6                    0          820990               30     3     3    True   ('copy_', None)   torch.float64   False       Torch                       (81, 9, 1)            torch.Size([9, 9, 9])
           7/32        63:20       CPUCopyBench                    7                    0                    7                    0          695782               30     3     3   False   ('copy_', None)   torch.float32   False       Torch                  (1458, 162, 18)            torch.Size([9, 9, 9])
           8/32        60:48       CPUCopyBench                    7                    0                    7                    0          658744               30     3     3   False   ('copy_', None)   torch.float64   False       Torch                  (1458, 162, 18)            torch.Size([9, 9, 9])
           9/32        58:16       CPUCopyBench                 1338                    0                 1338                    0            3734               30     6     3    True   ('copy_', None)   torch.float32   False       Torch                    (9801, 99, 1)         torch.Size([99, 99, 99])
          10/32        55:44       CPUCopyBench                 1342                    0                 1341                    0            3725               30     6     3    True   ('copy_', None)   torch.float64   False       Torch                    (9801, 99, 1)         torch.Size([99, 99, 99])
          11/32        53:13       CPUCopyBench                 8991                   24                 8987                   24             556               30     6     3   False   ('copy_', None)   torch.float32   False       Torch               (176418, 1782, 18)         torch.Size([99, 99, 99])
          12/32        50:42       CPUCopyBench                13484                   45                13479                   45             371               30     6     3   False   ('copy_', None)   torch.float64   False       Torch               (176418, 1782, 18)         torch.Size([99, 99, 99])
          13/32        48:11       CPUCopyBench                13953                    3                13947                    3             359               30     7     3    True   ('copy_', None)   torch.float32   False       Torch                  (46225, 215, 1)      torch.Size([215, 215, 215])
          14/32        45:39       CPUCopyBench                18650                   33                18642                   33             268               30     7     3    True   ('copy_', None)   torch.float64   False       Torch                  (46225, 215, 1)      torch.Size([215, 215, 215])
          15/32        43:10       CPUCopyBench               100410                  278               100369                  277              50               30     7     3   False   ('copy_', None)   torch.float32   False       Torch               (832050, 3870, 18)      torch.Size([215, 215, 215])
          16/32        40:43       CPUCopyBench               140376                  250               140320                  249              36               30     7     3   False   ('copy_', None)   torch.float64   False       Torch               (832050, 3870, 18)      torch.Size([215, 215, 215])
          17/32        38:09       CPUCopyBench                    5                    0                    5                    0          999073               30     1     3    True   ('copy_', None)   torch.float32    True       Torch                        (2, 4, 1)            torch.Size([2, 2, 2])
          18/32        35:36       CPUCopyBench                    4                    0                    4                    0         1002029               30     1     3    True   ('copy_', None)   torch.float64    True       Torch                        (2, 4, 1)            torch.Size([2, 2, 2])
          19/32        33:03       CPUCopyBench                    5                    0                    5                    0          992863               30     1     3   False   ('copy_', None)   torch.float32    True       Torch                     (36, 72, 18)            torch.Size([2, 2, 2])
          20/32        30:30       CPUCopyBench                    4                    0                    4                    0         1000780               30     1     3   False   ('copy_', None)   torch.float64    True       Torch                     (36, 72, 18)            torch.Size([2, 2, 2])
          21/32        27:58       CPUCopyBench                    6                    0                    6                    0          736492               30     3     3    True   ('copy_', None)   torch.float32    True       Torch                       (9, 81, 1)            torch.Size([9, 9, 9])
          22/32        25:25       CPUCopyBench                    6                    0                    6                    0          740196               30     3     3    True   ('copy_', None)   torch.float64    True       Torch                       (9, 81, 1)            torch.Size([9, 9, 9])
          23/32        22:52       CPUCopyBench                    7                    0                    7                    0          668404               30     3     3   False   ('copy_', None)   torch.float32    True       Torch                  (162, 1458, 18)            torch.Size([9, 9, 9])
          24/32        20:19       CPUCopyBench                    7                    0                    7                    0          626416               30     3     3   False   ('copy_', None)   torch.float64    True       Torch                  (162, 1458, 18)            torch.Size([9, 9, 9])
          25/32        17:47       CPUCopyBench                 1494                    8                 1493                    8            3346               30     6     3    True   ('copy_', None)   torch.float32    True       Torch                    (99, 9801, 1)         torch.Size([99, 99, 99])
          26/32        15:14       CPUCopyBench                 1503                    5                 1502                    5            3326               30     6     3    True   ('copy_', None)   torch.float64    True       Torch                    (99, 9801, 1)         torch.Size([99, 99, 99])
          27/32        12:42       CPUCopyBench                 9302                   25                 9298                   25             537               30     6     3   False   ('copy_', None)   torch.float32    True       Torch               (1782, 176418, 18)         torch.Size([99, 99, 99])
          28/32        10:09       CPUCopyBench                14004                   36                13999                   36             357               30     6     3   False   ('copy_', None)   torch.float64    True       Torch               (1782, 176418, 18)         torch.Size([99, 99, 99])
          29/32         7:37       CPUCopyBench                14893                   46                14887                   46             336               30     7     3    True   ('copy_', None)   torch.float32    True       Torch                  (215, 46225, 1)      torch.Size([215, 215, 215])
          30/32         5:04       CPUCopyBench                19426                   57                19418                   57             258               30     7     3    True   ('copy_', None)   torch.float64    True       Torch                  (215, 46225, 1)      torch.Size([215, 215, 215])
          31/32         2:32       CPUCopyBench               101009                  254               100968                  253              50               30     7     3   False   ('copy_', None)   torch.float32    True       Torch               (3870, 832050, 18)      torch.Size([215, 215, 215])
          32/32         0:00       CPUCopyBench               141436                  529               141380                  530              35               30     7     3   False   ('copy_', None)   torch.float64    True       Torch               (3870, 832050, 18)      torch.Size([215, 215, 215])

AFTER:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Job number   ETA (mm:ss)         Benchmark       Time mean (us)        Time std (us)        CPU mean (us)         CPU std (us)      Iter. mean             Rep.          function   trans    cont   mag           dtype   dim   framework                          strides                            sizes
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
           1/32        78:34       CPUCopyBench                    5                    0                    5                    0          957369               30   ('copy_', None)   False    True     1   torch.float32     3       Torch                        (4, 2, 1)            torch.Size([2, 2, 2])
           2/32        76:02       CPUCopyBench                    5                    0                    5                    0          904700               30   ('copy_', None)    True    True     1   torch.float32     3       Torch                        (2, 4, 1)            torch.Size([2, 2, 2])
           3/32        73:30       CPUCopyBench                    5                    0                    5                    0          957313               30   ('copy_', None)   False    True     1   torch.float64     3       Torch                        (4, 2, 1)            torch.Size([2, 2, 2])
           4/32        70:58       CPUCopyBench                    5                    0                    5                    0          897145               30   ('copy_', None)    True    True     1   torch.float64     3       Torch                        (2, 4, 1)            torch.Size([2, 2, 2])
           5/32        68:26       CPUCopyBench                    5                    0                    5                    0          922954               30   ('copy_', None)   False    True     3   torch.float32     3       Torch                       (81, 9, 1)            torch.Size([9, 9, 9])
           6/32        65:54       CPUCopyBench                    7                    0                    7                    0          641419               30   ('copy_', None)    True    True     3   torch.float32     3       Torch                       (9, 81, 1)            torch.Size([9, 9, 9])
           7/32        63:22       CPUCopyBench                    5                    0                    5                    0          877614               30   ('copy_', None)   False    True     3   torch.float64     3       Torch                       (81, 9, 1)            torch.Size([9, 9, 9])
           8/32        60:49       CPUCopyBench                    7                    0                    7                    0          667641               30   ('copy_', None)    True    True     3   torch.float64     3       Torch                       (9, 81, 1)            torch.Size([9, 9, 9])
           9/32        58:17       CPUCopyBench                  397                    0                  397                    0           12574               30   ('copy_', None)   False    True     6   torch.float32     3       Torch                    (9801, 99, 1)         torch.Size([99, 99, 99])
          10/32        55:45       CPUCopyBench                  528                    0                  528                    0            9462               30   ('copy_', None)    True    True     6   torch.float32     3       Torch                    (99, 9801, 1)         torch.Size([99, 99, 99])
          11/32        53:13       CPUCopyBench                  784                    0                  784                    0            6374               30   ('copy_', None)   False    True     6   torch.float64     3       Torch                    (9801, 99, 1)         torch.Size([99, 99, 99])
          12/32        50:41       CPUCopyBench                  900                    0                  899                    0            5554               30   ('copy_', None)    True    True     6   torch.float64     3       Torch                    (99, 9801, 1)         torch.Size([99, 99, 99])
          13/32        48:10       CPUCopyBench                 8531                   25                 8527                   25             586               30   ('copy_', None)   False    True     7   torch.float32     3       Torch                  (46225, 215, 1)      torch.Size([215, 215, 215])
          14/32        45:38       CPUCopyBench                 9581                   15                 9577                   15             522               30   ('copy_', None)    True    True     7   torch.float32     3       Torch                  (215, 46225, 1)      torch.Size([215, 215, 215])
          15/32        43:07       CPUCopyBench                18342                   19                18335                   19             273               30   ('copy_', None)   False    True     7   torch.float64     3       Torch                  (46225, 215, 1)      torch.Size([215, 215, 215])
          16/32        40:35       CPUCopyBench                19461                   18                19453                   18             257               30   ('copy_', None)    True    True     7   torch.float64     3       Torch                  (215, 46225, 1)      torch.Size([215, 215, 215])
          17/32        38:03       CPUCopyBench                    5                    0                    5                    0          952869               30   ('copy_', None)   False   False     1   torch.float32     3       Torch                     (72, 36, 18)            torch.Size([2, 2, 2])
          18/32        35:30       CPUCopyBench                    5                    0                    5                    0          900599               30   ('copy_', None)    True   False     1   torch.float32     3       Torch                     (36, 72, 18)            torch.Size([2, 2, 2])
          19/32        32:58       CPUCopyBench                    5                    0                    5                    0          950621               30   ('copy_', None)   False   False     1   torch.float64     3       Torch                     (72, 36, 18)            torch.Size([2, 2, 2])
          20/32        30:26       CPUCopyBench                    5                    0                    5                    0          907684               30   ('copy_', None)    True   False     1   torch.float64     3       Torch                     (36, 72, 18)            torch.Size([2, 2, 2])
          21/32        27:53       CPUCopyBench                    7                    0                    7                    0          653460               30   ('copy_', None)   False   False     3   torch.float32     3       Torch                  (1458, 162, 18)            torch.Size([9, 9, 9])
          22/32        25:21       CPUCopyBench                    8                    0                    8                    0          589592               30   ('copy_', None)    True   False     3   torch.float32     3       Torch                  (162, 1458, 18)            torch.Size([9, 9, 9])
          23/32        22:49       CPUCopyBench                    8                    0                    8                    0          606024               30   ('copy_', None)   False   False     3   torch.float64     3       Torch                  (1458, 162, 18)            torch.Size([9, 9, 9])
          24/32        20:17       CPUCopyBench                    9                    0                    9                    0          553237               30   ('copy_', None)    True   False     3   torch.float64     3       Torch                  (162, 1458, 18)            torch.Size([9, 9, 9])
          25/32        17:45       CPUCopyBench                 9014                   23                 9011                   23             555               30   ('copy_', None)   False   False     6   torch.float32     3       Torch               (176418, 1782, 18)         torch.Size([99, 99, 99])
          26/32        15:13       CPUCopyBench                 9287                   20                 9283                   20             538               30   ('copy_', None)    True   False     6   torch.float32     3       Torch               (1782, 176418, 18)         torch.Size([99, 99, 99])
          27/32        12:41       CPUCopyBench                13255                   29                13250                   29             377               30   ('copy_', None)   False   False     6   torch.float64     3       Torch               (176418, 1782, 18)         torch.Size([99, 99, 99])
          28/32        10:08       CPUCopyBench                13853                   25                13847                   25             361               30   ('copy_', None)    True   False     6   torch.float64     3       Torch               (1782, 176418, 18)         torch.Size([99, 99, 99])
          29/32         7:37       CPUCopyBench                99736                  191                99695                  190              51               30   ('copy_', None)   False   False     7   torch.float32     3       Torch               (832050, 3870, 18)      torch.Size([215, 215, 215])
          30/32         5:05       CPUCopyBench               100864                  164               100822                  163              50               30   ('copy_', None)    True   False     7   torch.float32     3       Torch               (3870, 832050, 18)      torch.Size([215, 215, 215])
          31/32         2:32       CPUCopyBench               136722                  284               136666                  283              37               30   ('copy_', None)   False   False     7   torch.float64     3       Torch               (832050, 3870, 18)      torch.Size([215, 215, 215])
          32/32         0:00       CPUCopyBench               138618                  284               138560                  282              37               30   ('copy_', None)    True   False     7   torch.float64     3       Torch               (3870, 832050, 18)      torch.Size([215, 215, 215])

CONCLUSION:

Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost).

@cpuhrsch
Copy link
Copy Markdown
Contributor

@VitalyFedyunin - I'd say the slowdown on the small Tensors is fine, given that constant overhead is a separate topic in itself that we need to tackle holistically. Otherwise this looks great from a performance perspective.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Apr 25, 2019
Summary:
Replace cpu_apply functions with the TensorIterator.
Vectorize copy and clone functions.
Move big pieces of the code to cpu kernels folder to be able to use AVX2.
Add fast path for copy_ function if tensor types matches.

Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost).
Pull Request resolved: pytorch/pytorch#18618

Differential Revision: D14954118

Pulled By: VitalyFedyunin

fbshipit-source-id: 9d9bdf3fd9d5e539a03071cced50d0a47bac1615
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@VitalyFedyunin merged this pull request in 465799f.

Copy link
Copy Markdown
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we doing anything to prevent the slowdown for small tensors? This is as important as the large-tensor perf.

if (self.scalar_type() == src.scalar_type()) {
copy_kernel_same_type(kCPU, self, src);
} else {
AT_CHECK(self.numel() == src.numel(), "sizes do not match");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also "sizes do not match" is a bad error message. We should mention what the shapes were, and what't the problem.

@colesbury
Copy link
Copy Markdown
Member

@apaszke I'm working on refactoring the copy dispatch, which I think will reduce the overhead of this op. I don't think anyone is working on the broader problem of overhead across ops.

@VitalyFedyunin
Copy link
Copy Markdown
Contributor Author

I'm planning to look at TensorIterator and smaller tensors problem. But it is about one month from now.

@resistor
Copy link
Copy Markdown
Contributor

resistor commented May 1, 2019

@VitalyFedyunin You just undid my change from #19198 , eliminating type dispatch from the same-type-path, which tested both faster and smaller code size for me. It'd be nice to be pinged when you're planning to undo something I just did.

@VitalyFedyunin
Copy link
Copy Markdown
Contributor Author

VitalyFedyunin commented May 1, 2019

@resistor I did not eliminate same type dispatch, but indeed memcpy implementation #19198 was replaced by TensorIterator with Vec256, which is per tests is faster than the previous version of the code.

It it easy to replace it back. But I think until we get reliable benchmarking tool, it would be hunting shadows.

@colesbury @cpuhrsch any thoughts?

zhangguanheng66 pushed a commit to zhangguanheng66/pytorch that referenced this pull request May 6, 2019
…h#18618)

Summary:
Replace cpu_apply functions with the TensorIterator.
Vectorize copy and clone functions.
Move big pieces of the code to cpu kernels folder to be able to use AVX2.
Add fast path for copy_ function if tensor types matches.

Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost).
Pull Request resolved: pytorch#18618

Differential Revision: D14954118

Pulled By: VitalyFedyunin

fbshipit-source-id: 9d9bdf3fd9d5e539a03071cced50d0a47bac1615
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…h#18618)

Summary:
Replace cpu_apply functions with the TensorIterator.
Vectorize copy and clone functions.
Move big pieces of the code to cpu kernels folder to be able to use AVX2.
Add fast path for copy_ function if tensor types matches.

Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost).
Pull Request resolved: pytorch#18618

Differential Revision: D14954118

Pulled By: VitalyFedyunin

fbshipit-source-id: 9d9bdf3fd9d5e539a03071cced50d0a47bac1615
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants