Replace cpu_apply with TensorIterator inside of Copy function#18618
Replace cpu_apply with TensorIterator inside of Copy function#18618VitalyFedyunin wants to merge 28 commits into
Conversation
facebook-github-bot
left a comment
There was a problem hiding this comment.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
| auto builder = TensorIterator::Builder(); | ||
| builder.add_output(out); | ||
| builder.add_input(a); | ||
| if (!resize_outputs) { |
There was a problem hiding this comment.
Down the road, is this something we want to remove again? This can be part of a wider discussion around operator contracts.
There was a problem hiding this comment.
This is topic for larger discussion. I assume copy should only accept same shapes of src dst (as it was before), but if we want to have broadcasting here, it is also possible as soon as we rewrite TH code.
watch: pytorch#19345 - same as this one pytorch#18618 - use tensor iterator
facebook-github-bot
left a comment
There was a problem hiding this comment.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot retest this please |
This reverts commit 72538dd.
|
Move done as well as cleanup. I still planning to look what is the deal with the Scalar and do benchmarks. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Done with all changes, please review. |
| if (self.scalar_type() == src.scalar_type()) { | ||
| copy_kernel_same_type(kCPU, self, src); | ||
| } else { | ||
| AT_CHECK(self.numel() == src.numel(), "sizes do not match"); |
There was a problem hiding this comment.
Numel doesn't give you all info about the size. I'm not sure what a better word might be.
There was a problem hiding this comment.
It is copy of the current aten/src/ATen/native/Copy.cpp:23 . I have no better wording for it.
There was a problem hiding this comment.
Also "sizes do not match" is a bad error message. We should mention what the shapes were, and what't the problem.
There was a problem hiding this comment.
This should probably be an AT_ASSERT, but I don't think it's super important. We do a proper shape check before this in copy_ via expand_inplace.
|
|
||
| parallel_for(0, dst.nbytes(), COPY_GRAIN_SIZE, sample); | ||
| if (self.scalar_type() == at::ScalarType::Half) { | ||
| unary_kernel(*iter, [=](at::Half a) -> at::Half { return a; }); |
There was a problem hiding this comment.
You might be able to use int16 here and vectorize.
|
|
||
| // output is contiguous, arg1 is scalar | ||
| template <typename traits> | ||
| static inline bool is_unary_contiguous_s1(const int64_t* strides) { |
There was a problem hiding this comment.
Might this be more accurately described by a nullary operation as about to be added via #18876
There was a problem hiding this comment.
Yes, and it will save me from the writing vectorized loop for scalar case. Right now it is unnecessary as we never broadcast during the copy. But I look forward to both PRs to be landed to complete vec.
There was a problem hiding this comment.
Right now it is unnecessary as we never broadcast during the copy
We support broadcasting in copy-operations (on the src but not dst tensor). We should have a test case for this, but you should double-check that it still works.
This is not quite the same as nullary (zero-input) operation, although they are somewhat similar. For example:
x = torch.randn(8, 1024)
y = torch.randn(8, 1)
x.copy_(y)will use this case. (Unlike fill_, y is a Tensor here)
There was a problem hiding this comment.
I should add that the broadcasting happens in between Tensor::copy_ and Tensor::s_copy_ by some auto-generated code. The copy code doesn't handle it explicitly. (Maybe that's what you meant?)
A lot of the functions that use TensorIterator no longer use this pattern (there's no longer an s_add or s_mul) because TensorIterator handles broadcasting and shape checks. Eventually we can do that for copy_ (it'll reduce some boilerplate) but that will come later.
There was a problem hiding this comment.
I should add that the broadcasting happens in between
Tensor::copy_andTensor::s_copy_by some auto-generated code. The copy code doesn't handle it explicitly. (Maybe that's what you meant?)
Yes. It is actually happening inside of the _copy by
std::tie(b_src) = expand_inplace(self, src, "copy");
A lot of the functions that use
TensorIteratorno longer use this pattern (there's no longer ans_addors_mul) becauseTensorIteratorhandles broadcasting and shape checks. Eventually we can do that forcopy_(it'll reduce some boilerplate) but that will come later.
Agree, as soon as I check transpose edge case (as I suspect it to be slower than vectorized TensorIterator) we can remove resize and leave broadcasting to TensorIterator.
|
Accepted under the assumption that nullary ops will be implemented and replace the unary_s1 case. |
|
|
||
| // output is contiguous, arg1 is scalar | ||
| template <typename traits> | ||
| static inline bool is_unary_contiguous_s1(const int64_t* strides) { |
There was a problem hiding this comment.
Right now it is unnecessary as we never broadcast during the copy
We support broadcasting in copy-operations (on the src but not dst tensor). We should have a test case for this, but you should double-check that it still works.
This is not quite the same as nullary (zero-input) operation, although they are somewhat similar. For example:
x = torch.randn(8, 1024)
y = torch.randn(8, 1)
x.copy_(y)will use this case. (Unlike fill_, y is a Tensor here)
| constexpr int64_t COPY_GRAIN_SIZE = 20000; | ||
| template <typename self_T> | ||
| void copy_kernel_cast_t_impl(Tensor& self, const Tensor& src) { | ||
| auto builder = TensorIterator::Builder(); |
There was a problem hiding this comment.
I think it's a bit better to construct the TensorIterator once in _s_copy__cpu and pass it as an argument (instead of passing the Tensors as arguments). The reasons for this are:
- It's a step towards centralizing error checks (reduces duplicate code and risk of missing shape or dtype checks for certain cases)
- It better matches the pattern of the other kernels and people are likely to use this as an example
There was a problem hiding this comment.
On the second thought, right now TensorIterator construction and loop are staying together, and Copy.cpp doesn't need to know anything about implementation details of the kernel. Wich perfectly isolates implementation from dispatch and allows us to replace kernels (if necessary) at ease.
| memcpy(self_seg, src_seg, len); | ||
| }; | ||
| static void copy_kernel_same_type_impl(Tensor& self, const Tensor& src) { | ||
| auto builder = TensorIterator::Builder(); |
There was a problem hiding this comment.
same here (see above comment)
|
Single threaded: Before optimization: After optimization: About 33% perf improvement on bigger tensors, all data types. Multithreaded: Before After About 25% improvement. Testing script: Numbers are reproducible between multiple invocations. Longer pytorch/benchmarks are running now. |
|
BEFORE: AFTER: CONCLUSION: Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost). |
|
@VitalyFedyunin - I'd say the slowdown on the small Tensors is fine, given that constant overhead is a separate topic in itself that we need to tackle holistically. Otherwise this looks great from a performance perspective. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@VitalyFedyunin is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Replace cpu_apply functions with the TensorIterator. Vectorize copy and clone functions. Move big pieces of the code to cpu kernels folder to be able to use AVX2. Add fast path for copy_ function if tensor types matches. Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost). Pull Request resolved: pytorch/pytorch#18618 Differential Revision: D14954118 Pulled By: VitalyFedyunin fbshipit-source-id: 9d9bdf3fd9d5e539a03071cced50d0a47bac1615
|
@VitalyFedyunin merged this pull request in 465799f. |
apaszke
left a comment
There was a problem hiding this comment.
Are we doing anything to prevent the slowdown for small tensors? This is as important as the large-tensor perf.
| if (self.scalar_type() == src.scalar_type()) { | ||
| copy_kernel_same_type(kCPU, self, src); | ||
| } else { | ||
| AT_CHECK(self.numel() == src.numel(), "sizes do not match"); |
There was a problem hiding this comment.
Also "sizes do not match" is a bad error message. We should mention what the shapes were, and what't the problem.
|
@apaszke I'm working on refactoring the copy dispatch, which I think will reduce the overhead of this op. I don't think anyone is working on the broader problem of overhead across ops. |
|
I'm planning to look at TensorIterator and smaller tensors problem. But it is about one month from now. |
|
@VitalyFedyunin You just undid my change from #19198 , eliminating type dispatch from the same-type-path, which tested both faster and smaller code size for me. It'd be nice to be pinged when you're planning to undo something I just did. |
|
@resistor I did not eliminate same type dispatch, but indeed It it easy to replace it back. But I think until we get reliable benchmarking tool, it would be hunting shadows. @colesbury @cpuhrsch any thoughts? |
…h#18618) Summary: Replace cpu_apply functions with the TensorIterator. Vectorize copy and clone functions. Move big pieces of the code to cpu kernels folder to be able to use AVX2. Add fast path for copy_ function if tensor types matches. Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost). Pull Request resolved: pytorch#18618 Differential Revision: D14954118 Pulled By: VitalyFedyunin fbshipit-source-id: 9d9bdf3fd9d5e539a03071cced50d0a47bac1615
…h#18618) Summary: Replace cpu_apply functions with the TensorIterator. Vectorize copy and clone functions. Move big pieces of the code to cpu kernels folder to be able to use AVX2. Add fast path for copy_ function if tensor types matches. Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost). Pull Request resolved: pytorch#18618 Differential Revision: D14954118 Pulled By: VitalyFedyunin fbshipit-source-id: 9d9bdf3fd9d5e539a03071cced50d0a47bac1615
Replace cpu_apply functions with the TensorIterator.
Vectorize copy and clone functions.
Move big pieces of the code to cpu kernels folder to be able to use AVX2.
Add fast path for copy_ function if tensor types matches.
Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost).