Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: move exp, log, frexp, ldexp to SIMD dispatching #18101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 7, 2021

Conversation

seiko2plus
Copy link
Member

The second patch in a series of pull-requests aims to facilitate the migration
process to our new SIMD interface(NPYV).

It is basically a process that focuses on getting rid of the main umath SIMD source simd.inc,
which contains almost all SIMD kernels, by splitting it into several dispatch-able sources without
changing the base code, which facilitates the review process during the move to NPYV(universal intrinsics).

In this patch, we have moved the the following raw SIMD loops to the new dispatcher:

  • FLOAT_exp, DOUBLE_exp
  • FLOAT_log, DOUBLE_log
  • FLOAT_frexp, DOUBLE_frexp
  • FLOAT_ldexp, DOUBLE_ldexp

@seiko2plus seiko2plus force-pushed the ditch_simd_exp_log branch 3 times, most recently from 2fa0b60 to 6a4b53a Compare January 2, 2021 08:28
  The second patch in a series of pull-requests aims to facilitate the migration
  process to our new SIMD interface(NPYV).

  It is basically a process that focuses on getting rid of the main umath SIMD source `simd.inc`,
  which contains almost all SIMD kernels, by splitting it into several dispatch-able sources without
  changing the base code, which facilitates the review process during the move to NPYV(universal intrinsics).

  In this patch, we have moved the the following raw SIMD loops to the new dispatcher:
   - FLOAT_exp,   DOUBLE_exp
   - FLOAT_log,   DOUBLE_log
   - FLOAT_frexp, DOUBLE_frexp
   - FLOAT_ldexp, DOUBLE_ldexp
@seiko2plus seiko2plus marked this pull request as ready for review January 2, 2021 09:16
@mattip mattip changed the title ENH, SIMD: Ditching the old CPU dispatcher(Exp & Log) ENH: move exp, log, frexp, ldexp to SIMD dispatching Jan 3, 2021
@mattip mattip added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jan 3, 2021
@mattip
Copy link
Member

mattip commented Jan 3, 2021

Could you benchmark the loops to make sure we did not negatively impact x86_64 performance and if possible on ARM64 as well?

It seems coverage thinks some of the refactored code is not covered. Is there a test we could add to hit that code path?

__m128i vindex,
__m256d mask)
{
return _mm256_mask_i32gather_pd(src, addr, vindex, mask, 8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering reconstruct exp/log by using universal intrinsics so the performance improvement benefits other architecture, but some intrinsics such as _mm256_mask_i32gather_pd, _mm256_blendv_ps,_mm512_getmant_ps is hard to simulate or have a risk to get a slower performance than using scalar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Qiyu8 is this an actionable review comment on this PR? It seems like more of a general statement of intent, correct?

Copy link
Member Author

@seiko2plus seiko2plus Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but some intrinsics such as _mm256_mask_i32gather_pd, _mm256_blendv_ps,_mm512_getmant_ps is hard to simulate or have a risk to get a slower performance than using scalar.

what do you mean hard to simulate? we already did it except for _mm512_getmant_ps

*_i32gather_* -> npyv_loadn_* and npyv_loadn_till_*
*_i32scatter_* -> npyv_storen_* and npyv_storen_till_*
_mm256_blend* -> npyv_select_*

have a risk to get a slower performance than using scalar.

native x86 gather and scatter operations are too expensive and the same for the emulated ones, we should use them without specializing only with large kernels. without specializing -> loops_trigonometric.dispatch.c.src, with specializing loops_unary_fp.dispatch.c.src

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: _mm256_blend* blend operations are "not" expensive almost for all SIMD extensions.

@mattip
Copy link
Member

mattip commented Jan 7, 2021

LGTM. This is a reshuffle to enable changes in the future, with no real code changes.

@mattip mattip merged commit 73fe877 into numpy:master Jan 7, 2021
@seiko2plus
Copy link
Member Author

seiko2plus commented Jan 7, 2021

@mattip,

Could you benchmark the loops to make sure we did not negatively impact x86_64 performance and if possible on ARM64 as well?

same as before nothing changed

It seems coverage thinks some of the refactored code is not covered. Is there a test we could add to hit that code path?

all cases are covered by tests but the issue here the coverage runs only for the highest supported SIMD target by the local machine and leaves the other paths untested.

Currently, I'm working on a lightweight execution tracer similar to opencv/7101, it will allow us to test the dispatcher paths
and also provides elapsed CPU ticks for each execution region.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants