Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add a fast path for batch-norm CPU inference.#19152

Closed
zheng-xq wants to merge 1 commit into
pytorch:masterfrom
zheng-xq:export-D14889728
Closed

Add a fast path for batch-norm CPU inference.#19152
zheng-xq wants to merge 1 commit into
pytorch:masterfrom
zheng-xq:export-D14889728

Conversation

@zheng-xq
Copy link
Copy Markdown
Contributor

@zheng-xq zheng-xq commented Apr 11, 2019

Summary:
Adding a fast path for batch-norm CPU inference when all tensors are contiguous.

  • Leverage vectorization through smiple loops.
  • Folding linear terms before computation.
  • For resnext-101, this version gets 18.95 times faster.

Differential Revision: D14889728

== Benchmark Results ==

batch_norm: data shape: [1, 256, 3136], bandwidth: 22.26 GB/s
batch_norm: data shape: [1, 65536, 1], bandwidth: 5.57 GB/s
batch_norm: data shape: [128, 2048, 1], bandwidth: 18.21 GB/s

Copy link
Copy Markdown
Collaborator

@soumith soumith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ship if tests pass

Copy link
Copy Markdown
Collaborator

@soumith soumith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember that this file is not compiled with AVX2 or AVX enabled (internally or externally).

We have to use dispatch for that. The same folder (ATen/native) has some examples on that.

Copy link
Copy Markdown
Collaborator

@soumith soumith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently it's already saturating memory bandwidth even without being vectorized

@jamesr66a jamesr66a added the module: vectorization Related to SIMD vectorization, e.g., Vec256 label Apr 11, 2019
@jamesr66a
Copy link
Copy Markdown
Collaborator

I believe we actually do compile with -mavx in fbcode. I'm going to patch this and verify the performance in OSS

@cpuhrsch
Copy link
Copy Markdown
Contributor

@zheng-xq - If you want to make use of the dispatch we use in OSS you need to move 'native/Normalization.cpp' into 'native/cpu/Normalization.cpp'. We can also do that as a separate step afterwards.

Comment thread aten/src/ATen/native/Normalization.cpp Outdated
Copy link
Copy Markdown
Collaborator

@jamesr66a jamesr66a Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RHS of this statement is computed in double precision when scalar_t=float, then it is being narrowed to float on assignment. Is that intended?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LLVM vectorizer does this with that:

        vpslld  $31, %ymm5, %ymm5
        vpsrad  $31, %ymm5, %ymm5
        xorl    %r10d, %r10d
        movq    -64(%rbp), %r11         ## 8-byte Reload
        movq    -56(%rbp), %r14         ## 8-byte Reload
        .p2align        4, 0x90
LBB0_33:                                ## =>This Inner Loop Header: Depth=1
        vcvtps2pd       (%rdi,%rbx), %ymm6
        vcvtps2pd       16(%rdi,%rbx), %ymm7
        vaddpd  %ymm7, %ymm2, %ymm7
        vaddpd  %ymm6, %ymm2, %ymm6
        vsqrtpd %ymm6, %ymm6
        vsqrtpd %ymm7, %ymm7
        vdivpd  %ymm7, %ymm3, %ymm7
        vdivpd  %ymm6, %ymm3, %ymm6
        vcvtpd2ps       %ymm6, %xmm6
        vcvtpd2ps       %ymm7, %xmm7
        vinsertf128     $1, %xmm7, %ymm6, %ymm6
        vmovups (%rsi,%rbx), %ymm7
        vmaskmovps      (%rdx,%rbx), %ymm4, %ymm8
        vpandn  %ymm8, %ymm5, %ymm8
        vmulps  %ymm6, %ymm7, %ymm9
        vmovups %ymm9, (%r14,%rbx)
        vmulps  (%rcx,%rbx), %ymm6, %ymm6
        vmulps  %ymm6, %ymm7, %ymm6
        vsubps  %ymm6, %ymm8, %ymm6
        vmovups %ymm6, (%r11,%rbx)
        vcvtps2pd       32(%rdi,%rbx), %ymm6
        vcvtps2pd       48(%rdi,%rbx), %ymm7
        vaddpd  %ymm7, %ymm2, %ymm7
        vaddpd  %ymm6, %ymm2, %ymm6
        vsqrtpd %ymm6, %ymm6
        vsqrtpd %ymm7, %ymm7
        vdivpd  %ymm7, %ymm3, %ymm7
        vdivpd  %ymm6, %ymm3, %ymm6
        vcvtpd2ps       %ymm6, %xmm6
        vcvtpd2ps       %ymm7, %xmm7
        vinsertf128     $1, %xmm7, %ymm6, %ymm6
        vmovups 32(%rsi,%rbx), %ymm7
        vmaskmovps      32(%rdx,%rbx), %ymm4, %ymm8
        vpandn  %ymm8, %ymm5, %ymm8
        vmulps  %ymm6, %ymm7, %ymm9
        vmovups %ymm9, 32(%r14,%rbx)
        vmulps  32(%rcx,%rbx), %ymm6, %ymm6
        vmulps  %ymm6, %ymm7, %ymm6
        vsubps  %ymm6, %ymm8, %ymm6
        vmovups %ymm6, 32(%r11,%rbx)
        addq    $16, %r10
        addq    $64, %rbx
        addq    $2, %rax
        jne     LBB0_33
## %bb.34:
        testq   %r8, %r8
        je      LBB0_36

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled it into godbolt here: https://godbolt.org/z/OdsS-W

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should figure out what flags we shuold use when analyzing this stuff. iirc the configuration is:

  1. FBCode: -mavx
  2. OSS build (default): (none)
  3. I build with -march=native -mavx -mavx2 on my mac

Do you know what flags are used in the distributed binaries?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just confirmed that the OSS build with default settings emits SSE instructions

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distributed binaries use runtime dispatch and are compiled with -mavx and -mavx2 -mfma if I remember this correctly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OSS only compiles code with -mavx (and -mavx2) if it's in the cpu/ subdirectory. This isn't in that directory so you'll only get standard x86-64 instructions (including SSE)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding "The RHS of this statement is computed in double precision when scalar_t=float, then it is being narrowed to float on assignment. Is that intended?".

It was intended for better accuracy because often the epsilon was very small. And it would have no performance implication if (C << total_size). However, in degenerate cases where (N==1 && image_size==1), meaning (C==total_size), this is quite slow. So I change the default to casting epsilon in this case.

@jamesr66a
Copy link
Copy Markdown
Collaborator

FWIW for BatchNorm1D this falls back to scalar code because only the inner loop at line 101 is vectorized, and image_size=1

@jamesr66a
Copy link
Copy Markdown
Collaborator

This specialization gives me a 1.6x speedup on BatchNorm1D:

--- a/aten/src/ATen/native/Normalization.cpp
+++ b/aten/src/ATen/native/Normalization.cpp
@@ -96,13 +96,24 @@ void batch_norm_cpu_inference_contiguous(Tensor* output, const Tensor& input,
   // No need to use parallel_for as this function is supposed to be
   // memory-limited.
   // Keep the loop struture simple to make sure compiler vetorization kicks in.
-  for (int64_t n = 0; n < n_batch; ++n) {
-    for (int64_t c = 0; c < n_channel; ++c) {
-      for (int64_t i = 0; i < image_size; ++i) {
-        // Keep all the offset calculation within the inner loop for simplicity.
-        // Compilers are very good at hoisting the common part outside.
-        int64_t offset = n * n_channel * image_size + c * image_size + i;
-        output_data[offset] = input_data[offset] * alpha_data[c] + beta_data[c];
+  if (image_size != 1) {
+    for (int64_t n = 0; n < n_batch; ++n) {
+      for (int64_t c = 0; c < n_channel; ++c) {
+        for (int64_t i = 0; i < image_size; ++i) {
+          // Keep all the offset calculation within the inner loop for simplicity.
+          // Compilers are very good at hoisting the common part outside.
+          int64_t offset = n * n_channel * image_size + c * image_size + i;
+          output_data[offset] = input_data[offset] * alpha_data[c] + beta_data[c];
+        }
+      }
+    }
+  } else {
+    for (int64_t n = 0; n < n_batch; ++n) {
+      for (int64_t c = 0; c < n_channel; ++c) {
+          // Keep all the offset calculation within the inner loop for simplicity.
+          // Compilers are very good at hoisting the common part outside.
+          int64_t offset = n * n_channel + c;
+          output_data[offset] = input_data[offset] * alpha_data[c] + beta_data[c];

Comment thread aten/src/ATen/native/Normalization.cpp Outdated
Copy link
Copy Markdown
Contributor

@gchanan gchanan Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make output "Tensor &" instead? That's the usual pattern for output parameters currently (you can search in this directory for functions ending in "_out").

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Collaborator

@jamesr66a jamesr66a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should address the things discussed in the comments

Comment thread aten/src/ATen/native/Normalization.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how much it matters, but computing the rsqrt as a distinct expression is losing precision compared to doing an fdiv for each of alpha_data and beta_data. It's probably fine if that necessary for performance, but if it's not then it would be nice to preserve the extra precision.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This portion is a performance critical. And from other experiments we had before, it shouldn't cause any accuracy problem.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is unvectorized in llvm because of the presence of the sqrt function (unless -ffast-math is set). Benchmark for the image_size=1 case:

batch_norm: data shape: [1, 65536, 1], bandwidth: 6.23 GB/s

And as a debug I printed out the timing for this loop ("first" loop) versus the loop below ("second" loop):

first time 0.000150609
second time 2.1667e-05

So, this loop is taking ~6x longer than the loop below. I am working with @ZolotukhinM to see if we can fix this

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpuhrsch we've found that setting -fno-math-errno allows the sqrt call to be vectorized. This actually seems fairly safe. Do you think we can turn this flag on in ATen?

@resistor
Copy link
Copy Markdown
Contributor

resistor commented Apr 12, 2019

@jamesr66a

This specialization gives me a 1.6x speedup on BatchNorm1D:

--- a/aten/src/ATen/native/Normalization.cpp
+++ b/aten/src/ATen/native/Normalization.cpp
@@ -96,13 +96,24 @@ void batch_norm_cpu_inference_contiguous(Tensor* output, const Tensor& input,
   // No need to use parallel_for as this function is supposed to be
   // memory-limited.
   // Keep the loop struture simple to make sure compiler vetorization kicks in.
-  for (int64_t n = 0; n < n_batch; ++n) {
-    for (int64_t c = 0; c < n_channel; ++c) {
-      for (int64_t i = 0; i < image_size; ++i) {
-        // Keep all the offset calculation within the inner loop for simplicity.
-        // Compilers are very good at hoisting the common part outside.
-        int64_t offset = n * n_channel * image_size + c * image_size + i;
-        output_data[offset] = input_data[offset] * alpha_data[c] + beta_data[c];
+  if (image_size != 1) {
+    for (int64_t n = 0; n < n_batch; ++n) {
+      for (int64_t c = 0; c < n_channel; ++c) {
+        for (int64_t i = 0; i < image_size; ++i) {
+          // Keep all the offset calculation within the inner loop for simplicity.
+          // Compilers are very good at hoisting the common part outside.
+          int64_t offset = n * n_channel * image_size + c * image_size + i;
+          output_data[offset] = input_data[offset] * alpha_data[c] + beta_data[c];
+        }
+      }
+    }
+  } else {
+    for (int64_t n = 0; n < n_batch; ++n) {
+      for (int64_t c = 0; c < n_channel; ++c) {
+          // Keep all the offset calculation within the inner loop for simplicity.
+          // Compilers are very good at hoisting the common part outside.
+          int64_t offset = n * n_channel + c;
+          output_data[offset] = input_data[offset] * alpha_data[c] + beta_data[c];

Wouldn't flattening the loop nest be a simpler fix?

@jamesr66a
Copy link
Copy Markdown
Collaborator

jamesr66a commented Apr 12, 2019 via email

@cpuhrsch cpuhrsch requested a review from VitalyFedyunin April 12, 2019 20:04
@zheng-xq
Copy link
Copy Markdown
Contributor Author

Thanks for the comments! PTAL

Copy link
Copy Markdown
Collaborator

@jamesr66a jamesr66a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, but a few more comments.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post the results from this benchmark in the PR description?

Comment thread aten/src/ATen/native/Normalization.cpp Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is unvectorized in llvm because of the presence of the sqrt function (unless -ffast-math is set). Benchmark for the image_size=1 case:

batch_norm: data shape: [1, 65536, 1], bandwidth: 6.23 GB/s

And as a debug I printed out the timing for this loop ("first" loop) versus the loop below ("second" loop):

first time 0.000150609
second time 2.1667e-05

So, this loop is taking ~6x longer than the loop below. I am working with @ZolotukhinM to see if we can fix this

Comment thread aten/src/ATen/native/Normalization.cpp Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that having this file outside of cpu/ emits SSE instructions, but surprisingly (to me) we're still hitting machine BW peak in my tests. For future proofing, maybe we should add a TODO to indicate we should enable newer vector extensions in the future, in case the balance of BW vs compute throughput on future CPU SKUs changes

Summary:
Pull Request resolved: #19152

Adding a fast path for batch-norm CPU inference when all tensors are contiguous.
* Leverage vectorization through smiple loops.
* Folding linear terms before computation.
* For resnext-101, this version gets 18.95 times faster.
* Add a microbenchmark:
* (buck build mode/opt -c python.package_style=inplace --show-output //caffe2/benchmarks/operator_benchmark:batchnorm_benchmark) && \
(OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/batchnorm_benchmark#binary.par)
* batch_norm: data shape: [1, 256, 3136], bandwidth: 22.26 GB/s
* batch_norm: data shape: [1, 65536, 1], bandwidth: 5.57 GB/s
* batch_norm: data shape: [128, 2048, 1], bandwidth: 18.21 GB/s

Reviewed By: soumith, BIT-silence

Differential Revision: D14889728

fbshipit-source-id: b2264ada175410f06c505ae57bd50d651410ced5
Copy link
Copy Markdown
Collaborator

@jamesr66a jamesr66a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

zdevito pushed a commit to zdevito/ATen that referenced this pull request Apr 17, 2019
Summary:
Pull Request resolved: pytorch/pytorch#19152

Adding a fast path for batch-norm CPU inference when all tensors are contiguous.
* Leverage vectorization through smiple loops.
* Folding linear terms before computation.
* For resnext-101, this version gets 18.95 times faster.
* Add a microbenchmark:
* (buck build mode/opt -c python.package_style=inplace --show-output //caffe2/benchmarks/operator_benchmark:batchnorm_benchmark) && \
(OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/batchnorm_benchmark#binary.par)
* batch_norm: data shape: [1, 256, 3136], bandwidth: 22.26 GB/s
* batch_norm: data shape: [1, 65536, 1], bandwidth: 5.57 GB/s
* batch_norm: data shape: [128, 2048, 1], bandwidth: 18.21 GB/s

Reviewed By: soumith, BIT-silence

Differential Revision: D14889728

fbshipit-source-id: 20c9e567e38ff7dbb9097873b85160eca2b0a795
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in 5627940.

@salexspb
Copy link
Copy Markdown
Contributor

salexspb commented Apr 18, 2019

Do we actually want to compare PT vs Caffe2 implementations as well?

I see that in Caffe2 we use Eigen, could have it worked here?
Btw, do we want to start sharing ops rather than implementing new ones from scratch and keeping old ones? (cc @dzhulgakov )

@soumith
Copy link
Copy Markdown
Collaborator

soumith commented Apr 19, 2019

about eigen in particular, we put some effort to develop our own packet abstraction (vec256.h) and a switchable threadpool. So Eigen doesn't add a lot of value, and maybe not necessary.

@cpuhrsch
Copy link
Copy Markdown
Contributor

cpuhrsch commented Apr 19, 2019

@salexspb - Eigen is an entirely separate topic in itself. So far we haven't seen any advantages in using it. Ping me directly if you want to talk more and see what we've done already. In general, a rigorous comparison to evaluate whether adding that dependency is worthwhile and a whole project in and of itself.

@salexspb
Copy link
Copy Markdown
Contributor

I am confused on the dependency part. Don't we already use it in Caffe2? How is unified build is working then?

@cpuhrsch
Copy link
Copy Markdown
Contributor

@salexspb - yes, that's true we already have it as part of our build chain. In terms of dependency, I mean includes. It's adding complexity to the code.

In general, we should have comparison benchmarks for operators like this which are already supported by Caffe2 and then see if we can recycle them. But it's often also worthwhile to see whether we can write faster code by using ATen's abstractions.

For this particular operator, I agree that it could be good to write a side-by-side comparison in a single script to compare Caffe2's and PyTorch's implementation.

facebook-github-bot pushed a commit that referenced this pull request Apr 23, 2019
Summary:
As suggested in #19152 (comment), this may give the compiler more opportunities for auto-vectorization
Pull Request resolved: #19552

Differential Revision: D15048358

Pulled By: jamesr66a

fbshipit-source-id: db2c2c515c3e9f7d22305c039ab0c8a867fc43a2
zhangguanheng66 pushed a commit to zhangguanheng66/pytorch that referenced this pull request May 6, 2019
Summary:
Pull Request resolved: pytorch#19152

Adding a fast path for batch-norm CPU inference when all tensors are contiguous.
* Leverage vectorization through smiple loops.
* Folding linear terms before computation.
* For resnext-101, this version gets 18.95 times faster.
* Add a microbenchmark:
* (buck build mode/opt -c python.package_style=inplace --show-output //caffe2/benchmarks/operator_benchmark:batchnorm_benchmark) && \
(OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/batchnorm_benchmark#binary.par)
* batch_norm: data shape: [1, 256, 3136], bandwidth: 22.26 GB/s
* batch_norm: data shape: [1, 65536, 1], bandwidth: 5.57 GB/s
* batch_norm: data shape: [128, 2048, 1], bandwidth: 18.21 GB/s

Reviewed By: soumith, BIT-silence

Differential Revision: D14889728

fbshipit-source-id: 20c9e567e38ff7dbb9097873b85160eca2b0a795
zhangguanheng66 pushed a commit to zhangguanheng66/pytorch that referenced this pull request May 6, 2019
Summary:
As suggested in pytorch#19152 (comment), this may give the compiler more opportunities for auto-vectorization
Pull Request resolved: pytorch#19552

Differential Revision: D15048358

Pulled By: jamesr66a

fbshipit-source-id: db2c2c515c3e9f7d22305c039ab0c8a867fc43a2
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Pull Request resolved: pytorch#19152

Adding a fast path for batch-norm CPU inference when all tensors are contiguous.
* Leverage vectorization through smiple loops.
* Folding linear terms before computation.
* For resnext-101, this version gets 18.95 times faster.
* Add a microbenchmark:
* (buck build mode/opt -c python.package_style=inplace --show-output //caffe2/benchmarks/operator_benchmark:batchnorm_benchmark) && \
(OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/batchnorm_benchmark#binary.par)
* batch_norm: data shape: [1, 256, 3136], bandwidth: 22.26 GB/s
* batch_norm: data shape: [1, 65536, 1], bandwidth: 5.57 GB/s
* batch_norm: data shape: [128, 2048, 1], bandwidth: 18.21 GB/s

Reviewed By: soumith, BIT-silence

Differential Revision: D14889728

fbshipit-source-id: 20c9e567e38ff7dbb9097873b85160eca2b0a795
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
As suggested in pytorch#19152 (comment), this may give the compiler more opportunities for auto-vectorization
Pull Request resolved: pytorch#19552

Differential Revision: D15048358

Pulled By: jamesr66a

fbshipit-source-id: db2c2c515c3e9f7d22305c039ab0c8a867fc43a2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: vectorization Related to SIMD vectorization, e.g., Vec256

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants