-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Poor performance of the libmxnet if OMP_PLACES environment variable is present #14087
Description
Description
Default build depends on both GOMP and Intel OMP runtime.
Environment info (Required)
----------Python Info----------
Version : 3.5.2
Compiler : GCC 5.4.0 20160609
Build : ('default', 'Nov 12 2018 13:43:14')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 8.1.1
Directory : /usr/lib/python3/dist-packages/pip
----------MXNet Info-----------
Version : 1.5.0
Directory : /home/stsukrov/workspace/mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Linux-4.4.0-1074-aws-x86_64-with-Ubuntu-16.04-xenial
system : Linux
node : ip-172-31-1-212
release : 4.4.0-1074-aws
version : #84-Ubuntu SMP Thu Dec 6 08:57:58 UTC 2018
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 72
On-line CPU(s) list: 0-71
Thread(s) per core: 2
Core(s) per socket: 18
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping: 4
CPU MHz: 3000.000
BogoMIPS: 6000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-17,36-53
NUMA node1 CPU(s): 18-35,54-71
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
----------Network Test----------
Setting timeout: 10
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0155 sec, LOAD: 0.0708 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0348 sec, LOAD: 0.2342 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0010 sec, LOAD: 0.3944 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0896 sec, LOAD: 0.7671 sec.
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0006 sec, LOAD: 0.9998 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.2689 sec, LOAD: 1.0834 sec.
Compiler (gcc/clang/mingw/visual studio):
gcc-5
MXNet commit hash: d684c59
Build info (Required if built from source)
Build config:
make -j20
Minimum reproducible example
Get mxnet
Build from scratch
run the benchmark - get performance on mxnet/batch size=1 like 200img/sec
set OMP_PLACES=cores
rerun the benchmark
get performance like 40 - which is poor
Steps to reproduce
(Paste the commands you ran that produced the error.)
- git clone ...
- make
- python3 benchmark_score.py
- export OMP_PLACES=cores
Rerunning with
export OMP_DISPLAY_ENV=verbose we have:
OPENMP DISPLAY ENVIRONMENT BEGIN
_OPENMP = '201511'
OMP_DYNAMIC = 'FALSE'
OMP_NESTED = 'FALSE'
OMP_NUM_THREADS = '72'
OMP_SCHEDULE = 'DYNAMIC'
OMP_PROC_BIND = 'TRUE'
OMP_PLACES = '{0,36},{1,37},{2,38},{3,39},{4,40},{5,41},{6,42},{7,43},{8,44},{9,45},{10,46},{11,47},{12,48},{13,49},{14,50},{15,51},{16,52},{17,53},{18,54},{19,55},{20,56},{21,57},{22,58},{23,59},{24,60},{25,61},{26,62},{27,63},{28,64},{29,65},{30,66},{31,67},{32,68},{33,69},{34,70},{35,71}'
OMP_STACKSIZE = '0'
OMP_WAIT_POLICY = 'PASSIVE'
OMP_THREAD_LIMIT = '4294967295'
OMP_MAX_ACTIVE_LEVELS = '2147483647'
OMP_CANCELLATION = 'FALSE'
OMP_DEFAULT_DEVICE = '0'
OMP_MAX_TASK_PRIORITY = '0'
GOMP_CPU_AFFINITY = ''
GOMP_STACKSIZE = '0'
GOMP_SPINCOUNT = '300000'
OPENMP DISPLAY ENVIRONMENT END
OPENMP DISPLAY ENVIRONMENT BEGIN
_OPENMP='201611'
[host] KMP_ABORT_DELAY='0'
[host] KMP_ADAPTIVE_LOCK_PROPS='1,1024'
[host] KMP_ALIGN_ALLOC='64'
[host] KMP_ALL_THREADPRIVATE='288'
[host] KMP_ATOMIC_MODE='2'
[host] KMP_BLOCKTIME='200'
[host] KMP_CPUINFO_FILE: value is not defined
[host] KMP_DETERMINISTIC_REDUCTION='FALSE'
[host] KMP_DEVICE_THREAD_LIMIT='2147483647'
[host] KMP_DISP_HAND_THREAD='FALSE'
[host] KMP_DISP_NUM_BUFFERS='7'
[host] KMP_DUPLICATE_LIB_OK='FALSE'
[host] KMP_FORCE_REDUCTION: value is not defined
[host] KMP_FOREIGN_THREADS_THREADPRIVATE='TRUE'
[host] KMP_FORKJOIN_BARRIER='2,2'
[host] KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
[host] KMP_FORKJOIN_FRAMES='TRUE'
[host] KMP_FORKJOIN_FRAMES_MODE='3'
[host] KMP_GTID_MODE='3'
[host] KMP_HANDLE_SIGNALS='FALSE'
[host] KMP_HOT_TEAMS_MAX_LEVEL='1'
[host] KMP_HOT_TEAMS_MODE='0'
[host] KMP_INIT_AT_FORK='TRUE'
[host] KMP_INIT_WAIT='2048'
[host] KMP_ITT_PREPARE_DELAY='0'
[host] KMP_LIBRARY='throughput'
[host] KMP_LOCK_KIND='queuing'
[host] KMP_MALLOC_POOL_INCR='1M'
[host] KMP_NEXT_WAIT='1024'
[host] KMP_NUM_LOCKS_IN_BLOCK='1'
[host] KMP_PLAIN_BARRIER='2,2'
[host] KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
[host] KMP_REDUCTION_BARRIER='1,1'
[host] KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
[host] KMP_SCHEDULE='static,balanced;guided,iterative'
[host] KMP_SETTINGS='FALSE'
[host] KMP_SPIN_BACKOFF_PARAMS='4096,100'
[host] KMP_STACKOFFSET='64'
[host] KMP_STACKPAD='0'
[host] KMP_STACKSIZE='4M'
[host] KMP_STORAGE_MAP='FALSE'
[host] KMP_TASKING='2'
[host] KMP_TASKLOOP_MIN_TASKS='0'
[host] KMP_TASK_STEALING_CONSTRAINT='1'
[host] KMP_TEAMS_THREAD_LIMIT='72'
[host] KMP_TOPOLOGY_METHOD='all'
[host] KMP_USER_LEVEL_MWAIT='FALSE'
[host] KMP_VERSION='FALSE'
[host] KMP_WARNINGS='TRUE'
[host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %T thread %n bound to OS proc set {%a}'
[host] OMP_ALLOCATOR='omp_default_mem_alloc'
[host] OMP_CANCELLATION='FALSE'
[host] OMP_DEFAULT_DEVICE='0'
[host] OMP_DISPLAY_AFFINITY='FALSE'
[host] OMP_DISPLAY_ENV='VERBOSE'
[host] OMP_DYNAMIC='FALSE'
[host] OMP_MAX_ACTIVE_LEVELS='2147483647'
[host] OMP_MAX_TASK_PRIORITY='0'
[host] OMP_NESTED='FALSE'
[host] OMP_NUM_THREADS: value is not defined
[host] OMP_PLACES='cores'
[host] OMP_PROC_BIND='spread'
[host] OMP_SCHEDULE='static'
[host] OMP_STACKSIZE='4M'
OMP_TARGET_OFFLOAD=DEFAULT
[host] OMP_THREAD_LIMIT='2147483647'
[host] OMP_TOOL='enabled'
[host] OMP_TOOL_LIBRARIES: value is not defined
[host] OMP_WAIT_POLICY='PASSIVE'
[host] KMP_AFFINITY='noverbose,warnings,respect,granularity=core,compact,0,0'
OPENMP DISPLAY ENVIRONMENT END
The first one is probably GOMP
The second one is IOMP
Setting KMP_AFFINITY=verbose gives this:
OMP: Info #212: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0,36
OMP: Info #156: KMP_AFFINITY: 2 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 1 cores/pkg x 2 threads/core (1 total cores)
OMP: Info #214: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 36 maps to package 0 thread 1
which may mean Intel OMP suddenly thinks, it's just one core
Without OMP_PLACES looks fine:
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #159: KMP_AFFINITY: 2 packages x 18 cores/pkg x 2 threads/core (36 total cores)
What have you tried to solve it?
- Symlink libgomp.so.1 to intel omp
- Got the reasonable performance back