-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[PERFORMANCE] [v1.x] Layer normalization code from Marian for CPU #19601
Conversation
Experiment with OMP_NUM_THREADS=4, times in s, c5.12xlarge
|batchxchanne| New code | MKL |
| 1x 32 | 0.0000288| 0.0000278|
| 128x 32 | 0.0000308| 0.0000311|
| 2560x 32 | 0.0000712| 0.0000672|
| 4096x 32 | 0.0000946| 0.0000910|
| 8192x 32 | 0.0001597| 0.0001523|
|16384x 32 | 0.0002905| 0.0002619|
| 1x 64 | 0.0000264| 0.0000256|
| 128x 64 | 0.0000339| 0.0000330|
| 2560x 64 | 0.0000829| 0.0000972|
| 4096x 64 | 0.0001137| 0.0001356|
| 8192x 64 | 0.0002027| 0.0002435|
|16384x 64 | 0.0003715| 0.0004639|
| 1x 128 | 0.0000262| 0.0000263|
| 128x 128 | 0.0000325| 0.0000389|
| 2560x 128 | 0.0001074| 0.0001580|
| 4096x 128 | 0.0001505| 0.0002336|
| 8192x 128 | 0.0002861| 0.0004481|
|16384x 128 | 0.0005648| 0.0008613|
| 1x 256 | 0.0000273| 0.0000276|
| 128x 256 | 0.0000390| 0.0000431|
| 2560x 256 | 0.0001533| 0.0002811|
| 4096x 256 | 0.0002258| 0.0004300|
| 8192x 256 | 0.0004300| 0.0008464|
|16384x 256 | 0.0010436| 0.0017613|
| 1x 512 | 0.0000256| 0.0000302|
| 128x 512 | 0.0000408| 0.0000551|
| 2560x 512 | 0.0002444| 0.0005225|
| 4096x 512 | 0.0003828| 0.0008147|
| 8192x 512 | 0.0008832| 0.0017192|
|16384x 512 | 0.0058463| 0.0074497|
| 1x 768 | 0.0000252| 0.0000308|
| 128x 768 | 0.0000450| 0.0000676|
| 2560x 768 | 0.0003440| 0.0007719|
| 4096x 768 | 0.0005890| 0.0013346|
| 8192x 768 | 0.0014946| 0.0026145|
|16384x 768 | 0.0089495| 0.0113557|
| 1x 1024 | 0.0000285| 0.0000308|
| 128x 1024 | 0.0000487| 0.0000786|
| 2560x 1024 | 0.0004614| 0.0010190|
| 4096x 1024 | 0.0008083| 0.0017376|
| 8192x 1024 | 0.0059020| 0.0075588|
|16384x 1024 | 0.0116553| 0.0146855|
Benchmark program
```python
import mxnet as mx
import time
def time_procedure(shape, count):
data = mx.nd.random_uniform(shape=shape, low=-1.0, high = 1.0)
factors = mx.nd.random_uniform(shape=(shape[-1],))
mx.nd.waitall()
begin = time.time()
for i in range(0, count):
out = mx.nd.LayerNorm(data, factors, factors)
mx.nd.waitall()
return (time.time() - begin) / count
count = 200
for channel in [32, 64, 128, 256, 512, 768, 1024]:
for batch in [1, 128, 2560, 4096, 8192, 16384]:
s = (batch, channel)
timing = time_procedure(s, count)
print("{:5d}x{:5d} | {:.7f}".format(s[0], s[1], timing))
```
|
Hey @kpuatamazon , Thanks for submitting the PR
CI supported jobs: [clang, windows-cpu, centos-cpu, miscellaneous, windows-gpu, unix-gpu, edge, sanity, unix-cpu, website, centos-gpu] Note: |
|
Lint broken? |
|
Jenkins CI successfully triggered : [sanity] |
|
@mxnet-bot run ci [sanity] Maybe #19604 fixed lint? |
|
Jenkins CI successfully triggered : [sanity] |
|
@mxnet-bot run ci [centos-cpu, centos-gpu, clang, edge, miscellaneous, unix-cpu, unix-gpu, website, windows-cpu, windows-gpu] These have been "Expected" for days, seems the results got lost. |
|
@mxnet-bot run ci [website] Bot didn't respond, is anybody home? |
|
@mxnet-bot run ci [unix-cpu] #19081 seed 675318784 causes the test to fail in v1.x as well. |
|
Jenkins CI successfully triggered : [unix-cpu] |
|
I restarted the CI jobs a few times, looks like its passing now. Is it possible that the MKL implementation's performance might improve in the future? Should we keep that and hide it behind a build flag, making the Marian implementation default? |
|
Hi @samskalicky as requested there is now a My one-day-a-week contract ends 31 December 2020 so this is partly a goodbye and hope to get this in. I will be in today and probably 28 December. Afterwards, I am just @kpu. |
samskalicky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the MKL option, LGTM!
|
I've merged the latest v1.x in, added the Today is my last day. Hope it works. |
|
What are the next steps for this PR? Is this ready to be merged? |
Description
Adds a CPU kernel for LayerNorm that handles the common case of axis = -1. This is based upon the implementation from Marian at https://github.com/marian-nmt/marian-dev/blob/3b468e462809fe42a01a717c8d9307c465e6c35e/src/tensors/cpu/tensor_operators.cpp#L1047-L1087 .
Compared to the MXNet-internal generic implementation, the kernel is 1.6-29x faster. When used in Sockeye, end-to-end translation is 14%.
Compared to the MKL implementation, the kernel is 0.9-2.28x faster. Marian's is faster than MKL for all channels tested wider than 32.
Checklist
Essentials
test_operator.py:test_layer_normthat covers this well and it passes.Changes
Benchmarks
Speed
cmake -DCMAKE_BUILD_TYPE=Release -DUSE_MKLDNN=ON -DUSE_CUDA=OFF -DUSE_TVM_OP=OFF -DUSE_MKL_IF_AVAILABLE=OFF -DCMAKE_C_COMPILER=gcc-8 -DCMAKE_CXX_COMPILER=g++-8 -GNinjaexcept for the MKL case when-DUSE_MKL_IF_AVAILABLE=ONexport OMP_NUM_THREADS=4Benchmark program
Here are the results (in seconds). Yes, I included first run. Make your JIT faster.
AWS Sockeye
Observed a 14% speed up in end-to-end machine translation with Sockeye. Sockeye 2.2 (29795b82) on a c5.12xlarge with
export OMP_NUM_THREADS=4translating a test set.Compiled on Ubuntu 18 with
cmake -DCMAKE_BUILD_TYPE=Release -DUSE_MKLDNN=ON -DUSE_CUDA=OFF -DUSE_TVM_OP=OFF -DUSE_MKL_IF_AVAILABLE=OFF -DCMAKE_C_COMPILER=gcc-8 -DCMAKE_CXX_COMPILER=g++-8 -GNinja ..Note: no MKL.Before
After
The above runs were done as normal, without the profiler. I then turned the profiler on. We can see that LayerNorm is consuming a substantial amount of time:
Before
After
The new implementation is 7.21x as fast on average according to the profiler.
The number of LayerNorm invocations changes 0.02% because beam search iterations are impacted by tie breaking.
Unit test
Before: 62.210s
After: 61.321s
But note unit tests spend most of their time comparing things rather than running the kernels.
Comments