Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: Convert arithmetic from C universal intrinsics to C++ using Highway #27402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

abhishek-iitmadras
Copy link
Contributor

@abhishek-iitmadras abhishek-iitmadras commented Sep 16, 2024

This patch is all about to convert loops_arithmetic.dispatch.c.src to numpy/_core/src/umath/loops_arithmetic.dispatch.cpp using Google Highway as per NEP54 resolutions.

All Requirements are passed:

  1. Build should be successful on all the relevant platforms
  2. All tests should pass ,specially relevant test like numpy/_core/tests/test_ufunc.py, numpy/_core/tests/test_umath.py and numpy/_core/tests/test_umath_accuracy.py.
  3. performance regressions upto ~2.7x
  4. passes CI on all the relevant platforms (only tested once raised OSS PR)

@Mousius Mousius added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Sep 16, 2024
@abhishek-iitmadras abhishek-iitmadras changed the title Convert arithmetic from C universal intrinsics to C++ using Highway ENH: Convert arithmetic from C universal intrinsics to C++ using Highway Sep 16, 2024
|| (defined __clang__ && __clang_major__ < 8))
# define NPY_ALIGNOF(type) offsetof(struct {char c; type v;}, v)
#define NPY_ALIGNOF(type) __alignof__(type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation to change these macro's? You don't seem to be using them anywhere either?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Hi @r-devulap
Change to the NPY_ALIGNOF macro, replacing offsetof with __alignof__, is motivated by the need for compliance with modern C++ standards and to eliminate compiler errors. The original use of offsetof with an anonymous struct may not be standard in C++11 and later, which leads to errors as shown in pic.

On the other hand, modern compilers provide alignof, which is built-in, efficient, and compatible with both GCC and Clang, ensuring adherence to C++11 standards and simplifying alignment calculation

Copy link
Contributor Author

@abhishek-iitmadras abhishek-iitmadras Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. On GCC and Clang (Linux/macOS), the __alignof__ operator is supported and returns the required alignment of a type.
  2. On MSVC, __alignof__ is not recognized, so alignof can be used.

This causes an issue with failing of Window CI with MSVC compiler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original use of offsetof with an anonymous struct may not be standard in C++11 and later, which leads to errors as shown in pic.

Does this happen on the main branch as well? I suggest we make a separate PR for this change cos it seems unrelated to this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this happen on the main branch as well?

No

@r-devulap
Copy link
Member

@abhishek-iitmadras Is this patch still WIP or ready for review? I noticed this patch is missing the vsx4 specific optimizations in the original source file:

static inline void
vsx4_simd_divide_contig_@sfx@(char **args, npy_intp len)
{
npyv_lanetype_@sfx@ *src1 = (npyv_lanetype_@sfx@ *) args[0];
npyv_lanetype_@sfx@ *src2 = (npyv_lanetype_@sfx@ *) args[1];
npyv_lanetype_@sfx@ *dst1 = (npyv_lanetype_@sfx@ *) args[2];
const npyv_@sfx@ vzero = npyv_zero_@sfx@();
const int vstep = npyv_nlanes_@sfx@;
for (; len >= vstep; len -= vstep, src1 += vstep, src2 += vstep,
dst1 += vstep) {
npyv_@sfx@ a = npyv_load_@sfx@(src1);
npyv_@sfx@ b = npyv_load_@sfx@(src2);
npyv_@sfx@ c = vsx4_div_@sfx@(a, b);
npyv_store_@sfx@(dst1, c);
if (NPY_UNLIKELY(vec_any_eq(b, vzero))) {
npy_set_floatstatus_divbyzero();
}
}
for (; len > 0; --len, ++src1, ++src2, ++dst1) {
const npyv_lanetype_@sfx@ a = *src1;
const npyv_lanetype_@sfx@ b = *src2;
if (NPY_UNLIKELY(b == 0)) {
npy_set_floatstatus_divbyzero();
*dst1 = 0;
} else{
*dst1 = a / b;
}
}
npyv_cleanup();
}
/**end repeat**/
/**begin repeat
* Signed types
* #sfx = s8, s16, s32, s64#
* #len = 8, 16, 32, 64#
*/
static inline void
vsx4_simd_divide_contig_@sfx@(char **args, npy_intp len)
{
npyv_lanetype_@sfx@ *src1 = (npyv_lanetype_@sfx@ *) args[0];
npyv_lanetype_@sfx@ *src2 = (npyv_lanetype_@sfx@ *) args[1];
npyv_lanetype_@sfx@ *dst1 = (npyv_lanetype_@sfx@ *) args[2];
const npyv_@sfx@ vneg_one = npyv_setall_@sfx@(-1);
const npyv_@sfx@ vzero = npyv_zero_@sfx@();
const npyv_@sfx@ vmin = npyv_setall_@sfx@(NPY_MIN_INT@len@);
npyv_b@len@ warn_zero = npyv_cvt_b@len@_@sfx@(npyv_zero_@sfx@());
npyv_b@len@ warn_overflow = npyv_cvt_b@len@_@sfx@(npyv_zero_@sfx@());
const int vstep = npyv_nlanes_@sfx@;
for (; len >= vstep; len -= vstep, src1 += vstep, src2 += vstep,
dst1 += vstep) {
npyv_@sfx@ a = npyv_load_@sfx@(src1);
npyv_@sfx@ b = npyv_load_@sfx@(src2);
npyv_@sfx@ quo = vsx4_div_@sfx@(a, b);
npyv_@sfx@ rem = npyv_sub_@sfx@(a, vec_mul(b, quo));
// (b == 0 || (a == NPY_MIN_INT@len@ && b == -1))
npyv_b@len@ bzero = npyv_cmpeq_@sfx@(b, vzero);
npyv_b@len@ amin = npyv_cmpeq_@sfx@(a, vmin);
npyv_b@len@ bneg_one = npyv_cmpeq_@sfx@(b, vneg_one);
npyv_b@len@ overflow = npyv_and_@sfx@(bneg_one, amin);
warn_zero = npyv_or_@sfx@(bzero, warn_zero);
warn_overflow = npyv_or_@sfx@(overflow, warn_overflow);
// handle mixed case the way Python does
// ((a > 0) == (b > 0) || rem == 0)
npyv_b@len@ a_gt_zero = npyv_cmpgt_@sfx@(a, vzero);
npyv_b@len@ b_gt_zero = npyv_cmpgt_@sfx@(b, vzero);
npyv_b@len@ ab_eq_cond = npyv_cmpeq_@sfx@(a_gt_zero, b_gt_zero);
npyv_b@len@ rem_zero = npyv_cmpeq_@sfx@(rem, vzero);
npyv_b@len@ or = npyv_or_@sfx@(ab_eq_cond, rem_zero);
npyv_@sfx@ to_sub = npyv_select_@sfx@(or, vzero, vneg_one);
quo = npyv_add_@sfx@(quo, to_sub);
// Divide by zero
quo = npyv_select_@sfx@(bzero, vzero, quo);
// Overflow
quo = npyv_select_@sfx@(overflow, vmin, quo);
npyv_store_@sfx@(dst1, quo);
}
if (!vec_all_eq(warn_zero, vzero)) {
npy_set_floatstatus_divbyzero();
}
if (!vec_all_eq(warn_overflow, vzero)) {
npy_set_floatstatus_overflow();
}
for (; len > 0; --len, ++src1, ++src2, ++dst1) {
const npyv_lanetype_@sfx@ a = *src1;
const npyv_lanetype_@sfx@ b = *src2;
if (NPY_UNLIKELY(b == 0)) {
npy_set_floatstatus_divbyzero();
*dst1 = 0;
} else if (NPY_UNLIKELY((a == NPY_MIN_INT@len@) && (b == -1))) {
npy_set_floatstatus_overflow();
*dst1 = NPY_MIN_INT@len@;
} else {
*dst1 = a / b;
if (((a > 0) != (b > 0)) && ((*dst1 * b) != a)) {
*dst1 -= 1;
}
}
}
npyv_cleanup();
}

@abhishek-iitmadras
Copy link
Contributor Author

Is this patch still WIP or ready for review?

I am currently trying to modify the code to ensure all CI tests pass. At present, I am facing a challenge in obtaining a Windows machine with the MSVC compiler.

@r-devulap r-devulap added the HWY features related to google Highway label Oct 7, 2024
memcpy(dst, src, len * sizeof(T));
} else if (scalar == static_cast<T>(-1)) {
const auto vec_min_val = Set(d, std::numeric_limits<T>::min());
for (npy_intp i = 0; i < len; i += N) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly recommend splitting up loops into the main body, and a tail. The LoadN/StoreN functions are quite expensive, so in the main loop they should be replaced with LoadU/StoreU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack Done.

@jan-wassenberg
Copy link
Contributor

I am currently trying to modify the code to ensure all CI tests pass. At present, I am facing a challenge in obtaining a Windows machine with the MSVC compiler.

Might it help to run MSVC under an emulator (WINE)?

Comment on lines 56 to 64
return (void *)(*(volatile uint64_t *)obj);
#elif defined(_M_ARM64)
return (uint64_t)__ldar64((unsigned __int64 volatile *)obj);
return (void *)(uint64_t)__ldar64((unsigned __int64 volatile *)obj);
#endif
#else
#if defined(_M_X64) || defined(_M_IX86)
return *(volatile uint32_t *)obj;
return (void *)(*(volatile uint32_t *)obj);
#elif defined(_M_ARM64)
return (uint32_t)__ldar32((unsigned __int32 volatile *)obj);
return (void *)(uint32_t)__ldar32((unsigned __int32 volatile *)obj);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This window fix has already been merged, causing a conflict here.

@seiko2plus
Copy link
Member

@abhishek-iitmadras,

No performance regressions.

It's surprising to hear there are no performance regressions, especially on architectures like x86. As far as I understand, Highway will fall back to the FPU unit to handle integer vector division on architectures that don't provide native instructions for it. This fallback could potentially introduce a performance hit.

The C code, on the other hand, relies on division by invariant integers using multiplication, as mentioned at the beginning of the C source file:

* Floor division of signed is based on T. Granlund and P. L. Montgomery
* "Division by invariant integers using multiplication(see [Figure 6.1]
* https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556)"

For further clarification, please take a look at the following sources:

/**********************************************************************************
** Integer division
**********************************************************************************
* Almost all architecture (except Power10) doesn't support integer vector division,
* also the cost of scalar division in architectures like x86 is too high it can take
* 30 to 40 cycles on modern chips and up to 100 on old ones.
*
* Therefore we are using division by multiplying with precomputed reciprocal technique,
* the method that been used in this implementation is based on T. Granlund and P. L. Montgomery
* “Division by invariant integers using multiplication(see [Figure 4.1, 5.1]
* https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556)
*
* It shows a good impact for all architectures especially on X86,
* however computing divisor parameters is kind of expensive so this implementation
* should only works when divisor is a scalar and used multiple of times.
*
* The division process is separated into two intrinsics for each data type
*
* 1- npyv_{dtype}x3 npyv_divisor_{dtype} ({dtype} divisor);
* For computing the divisor parameters (multiplier + shifters + sign of divisor(signed only))
*
* 2- npyv_{dtype} npyv_divisor_{dtype} (npyv_{dtype} dividend, npyv_{dtype}x3 divisor_parms);
* For performing the final division.
*
** For example:
* int vstep = npyv_nlanes_s32; // number of lanes
* int x = 0x6e70;
* npyv_s32x3 divisor = npyv_divisor_s32(x); // init divisor params
* for (; len >= vstep; src += vstep, dst += vstep, len -= vstep) {
* npyv_s32 a = npyv_load_s32(*src); // load s32 vector from memory
* a = npyv_divc_s32(a, divisor); // divide all elements by x
* npyv_store_s32(dst, a); // store s32 vector into memory
* }
*
** NOTES:
* - For 64-bit division on Aarch64 and IBM/Power, we fall-back to the scalar division
* since emulating multiply-high is expensive and both architectures have very fast dividers.
*
***************************************************************
** Figure 4.1: Unsigned division by run–time invariant divisor
***************************************************************
* Initialization (given uword d with 1 ≤ d < 2^N):
* int l = ceil(log2(d));
* uword m = 2^N * (2^l− d) / d + 1;
* int sh1 = min(l, 1);
* int sh2 = max(l − 1, 0);
*
* For q = FLOOR(a/d), all uword:
* uword t1 = MULUH(m, a);
* q = SRL(t1 + SRL(a − t1, sh1), sh2);
*
************************************************************************************
** Figure 5.1: Signed division by run–time invariant divisor, rounded towards zero
************************************************************************************
* Initialization (given constant sword d with d !=0):
* int l = max(ceil(log2(abs(d))), 1);
* udword m0 = 1 + (2^(N+l-1)) / abs(d);
* sword m = m0 − 2^N;
* sword dsign = XSIGN(d);
* int sh = l − 1;
*
* For q = TRUNC(a/d), all sword:
* sword q0 = a + MULSH(m, a);
* q0 = SRA(q0, sh) − XSIGN(a);
* q = EOR(q0, dsign) − dsign;
*/
/**
* bit-scan reverse for non-zeros. returns the index of the highest set bit.
* equivalent to floor(log2(a))

/***************************
* Integer Division
***************************/
// See simd/intdiv.h for more clarification
// divide each unsigned 8-bit element by a precomputed divisor
NPY_FINLINE npyv_u8 npyv_divc_u8(npyv_u8 a, const npyv_u8x3 divisor)
{
const __m128i bmask = _mm_set1_epi32(0x00FF00FF);
const __m128i shf1b = _mm_set1_epi8(0xFFU >> _mm_cvtsi128_si32(divisor.val[1]));
const __m128i shf2b = _mm_set1_epi8(0xFFU >> _mm_cvtsi128_si32(divisor.val[2]));
// high part of unsigned multiplication
__m128i mulhi_even = _mm_mullo_epi16(_mm_and_si128(a, bmask), divisor.val[0]);
__m128i mulhi_odd = _mm_mullo_epi16(_mm_srli_epi16(a, 8), divisor.val[0]);
mulhi_even = _mm_srli_epi16(mulhi_even, 8);
__m128i mulhi = npyv_select_u8(bmask, mulhi_even, mulhi_odd);
// floor(a/d) = (mulhi + ((a-mulhi) >> sh1)) >> sh2
__m128i q = _mm_sub_epi8(a, mulhi);
q = _mm_and_si128(_mm_srl_epi16(q, divisor.val[1]), shf1b);
q = _mm_add_epi8(mulhi, q);
q = _mm_and_si128(_mm_srl_epi16(q, divisor.val[2]), shf2b);
return q;
}
// divide each signed 8-bit element by a precomputed divisor (round towards zero)
NPY_FINLINE npyv_s16 npyv_divc_s16(npyv_s16 a, const npyv_s16x3 divisor);
NPY_FINLINE npyv_s8 npyv_divc_s8(npyv_s8 a, const npyv_s8x3 divisor)
{
const __m128i bmask = _mm_set1_epi32(0x00FF00FF);
// instead of _mm_cvtepi8_epi16/_mm_packs_epi16 to wrap around overflow
__m128i divc_even = npyv_divc_s16(_mm_srai_epi16(_mm_slli_epi16(a, 8), 8), divisor);
__m128i divc_odd = npyv_divc_s16(_mm_srai_epi16(a, 8), divisor);
divc_odd = _mm_slli_epi16(divc_odd, 8);
return npyv_select_u8(bmask, divc_even, divc_odd);
}
// divide each unsigned 16-bit element by a precomputed divisor
NPY_FINLINE npyv_u16 npyv_divc_u16(npyv_u16 a, const npyv_u16x3 divisor)
{
// high part of unsigned multiplication
__m128i mulhi = _mm_mulhi_epu16(a, divisor.val[0]);
// floor(a/d) = (mulhi + ((a-mulhi) >> sh1)) >> sh2
__m128i q = _mm_sub_epi16(a, mulhi);
q = _mm_srl_epi16(q, divisor.val[1]);
q = _mm_add_epi16(mulhi, q);
q = _mm_srl_epi16(q, divisor.val[2]);
return q;
}
// divide each signed 16-bit element by a precomputed divisor (round towards zero)
NPY_FINLINE npyv_s16 npyv_divc_s16(npyv_s16 a, const npyv_s16x3 divisor)
{
// high part of signed multiplication
__m128i mulhi = _mm_mulhi_epi16(a, divisor.val[0]);
// q = ((a + mulhi) >> sh1) - XSIGN(a)
// trunc(a/d) = (q ^ dsign) - dsign
__m128i q = _mm_sra_epi16(_mm_add_epi16(a, mulhi), divisor.val[1]);
q = _mm_sub_epi16(q, _mm_srai_epi16(a, 15));
q = _mm_sub_epi16(_mm_xor_si128(q, divisor.val[2]), divisor.val[2]);
return q;
}
// divide each unsigned 32-bit element by a precomputed divisor
NPY_FINLINE npyv_u32 npyv_divc_u32(npyv_u32 a, const npyv_u32x3 divisor)
{
// high part of unsigned multiplication
__m128i mulhi_even = _mm_srli_epi64(_mm_mul_epu32(a, divisor.val[0]), 32);
__m128i mulhi_odd = _mm_mul_epu32(_mm_srli_epi64(a, 32), divisor.val[0]);
#ifdef NPY_HAVE_SSE41
__m128i mulhi = _mm_blend_epi16(mulhi_even, mulhi_odd, 0xCC);
#else
__m128i mask_13 = _mm_setr_epi32(0, -1, 0, -1);
mulhi_odd = _mm_and_si128(mulhi_odd, mask_13);
__m128i mulhi = _mm_or_si128(mulhi_even, mulhi_odd);
#endif
// floor(a/d) = (mulhi + ((a-mulhi) >> sh1)) >> sh2
__m128i q = _mm_sub_epi32(a, mulhi);
q = _mm_srl_epi32(q, divisor.val[1]);
q = _mm_add_epi32(mulhi, q);
q = _mm_srl_epi32(q, divisor.val[2]);
return q;
}
// divide each signed 32-bit element by a precomputed divisor (round towards zero)
NPY_FINLINE npyv_s32 npyv_divc_s32(npyv_s32 a, const npyv_s32x3 divisor)
{
__m128i asign = _mm_srai_epi32(a, 31);
#ifdef NPY_HAVE_SSE41
// high part of signed multiplication
__m128i mulhi_even = _mm_srli_epi64(_mm_mul_epi32(a, divisor.val[0]), 32);
__m128i mulhi_odd = _mm_mul_epi32(_mm_srli_epi64(a, 32), divisor.val[0]);
__m128i mulhi = _mm_blend_epi16(mulhi_even, mulhi_odd, 0xCC);
#else // not SSE4.1
// high part of "unsigned" multiplication
__m128i mulhi_even = _mm_srli_epi64(_mm_mul_epu32(a, divisor.val[0]), 32);
__m128i mulhi_odd = _mm_mul_epu32(_mm_srli_epi64(a, 32), divisor.val[0]);
__m128i mask_13 = _mm_setr_epi32(0, -1, 0, -1);
mulhi_odd = _mm_and_si128(mulhi_odd, mask_13);
__m128i mulhi = _mm_or_si128(mulhi_even, mulhi_odd);
// convert unsigned to signed high multiplication
// mulhi - ((a < 0) ? m : 0) - ((m < 0) ? a : 0);
const __m128i msign= _mm_srai_epi32(divisor.val[0], 31);
__m128i m_asign = _mm_and_si128(divisor.val[0], asign);
__m128i a_msign = _mm_and_si128(a, msign);
mulhi = _mm_sub_epi32(mulhi, m_asign);
mulhi = _mm_sub_epi32(mulhi, a_msign);
#endif
// q = ((a + mulhi) >> sh1) - XSIGN(a)
// trunc(a/d) = (q ^ dsign) - dsign
__m128i q = _mm_sra_epi32(_mm_add_epi32(a, mulhi), divisor.val[1]);
q = _mm_sub_epi32(q, asign);
q = _mm_sub_epi32(_mm_xor_si128(q, divisor.val[2]), divisor.val[2]);
return q;
}
// returns the high 64 bits of unsigned 64-bit multiplication
// xref https://stackoverflow.com/a/28827013
NPY_FINLINE npyv_u64 npyv__mullhi_u64(npyv_u64 a, npyv_u64 b)
{
__m128i lomask = npyv_setall_s64(0xffffffff);
__m128i a_hi = _mm_srli_epi64(a, 32); // a0l, a0h, a1l, a1h
__m128i b_hi = _mm_srli_epi64(b, 32); // b0l, b0h, b1l, b1h
// compute partial products
__m128i w0 = _mm_mul_epu32(a, b); // a0l*b0l, a1l*b1l
__m128i w1 = _mm_mul_epu32(a, b_hi); // a0l*b0h, a1l*b1h
__m128i w2 = _mm_mul_epu32(a_hi, b); // a0h*b0l, a1h*b0l
__m128i w3 = _mm_mul_epu32(a_hi, b_hi); // a0h*b0h, a1h*b1h
// sum partial products
__m128i w0h = _mm_srli_epi64(w0, 32);
__m128i s1 = _mm_add_epi64(w1, w0h);
__m128i s1l = _mm_and_si128(s1, lomask);
__m128i s1h = _mm_srli_epi64(s1, 32);
__m128i s2 = _mm_add_epi64(w2, s1l);
__m128i s2h = _mm_srli_epi64(s2, 32);
__m128i hi = _mm_add_epi64(w3, s1h);
hi = _mm_add_epi64(hi, s2h);
return hi;
}
// divide each unsigned 64-bit element by a precomputed divisor
NPY_FINLINE npyv_u64 npyv_divc_u64(npyv_u64 a, const npyv_u64x3 divisor)
{
// high part of unsigned multiplication
__m128i mulhi = npyv__mullhi_u64(a, divisor.val[0]);
// floor(a/d) = (mulhi + ((a-mulhi) >> sh1)) >> sh2
__m128i q = _mm_sub_epi64(a, mulhi);
q = _mm_srl_epi64(q, divisor.val[1]);
q = _mm_add_epi64(mulhi, q);
q = _mm_srl_epi64(q, divisor.val[2]);
return q;
}
// divide each signed 64-bit element by a precomputed divisor (round towards zero)
NPY_FINLINE npyv_s64 npyv_divc_s64(npyv_s64 a, const npyv_s64x3 divisor)
{
// high part of unsigned multiplication
__m128i mulhi = npyv__mullhi_u64(a, divisor.val[0]);
// convert unsigned to signed high multiplication
// mulhi - ((a < 0) ? m : 0) - ((m < 0) ? a : 0);
#ifdef NPY_HAVE_SSE42
const __m128i msign= _mm_cmpgt_epi64(_mm_setzero_si128(), divisor.val[0]);
__m128i asign = _mm_cmpgt_epi64(_mm_setzero_si128(), a);
#else
const __m128i msign= _mm_srai_epi32(_mm_shuffle_epi32(divisor.val[0], _MM_SHUFFLE(3, 3, 1, 1)), 31);
__m128i asign = _mm_srai_epi32(_mm_shuffle_epi32(a, _MM_SHUFFLE(3, 3, 1, 1)), 31);
#endif
__m128i m_asign = _mm_and_si128(divisor.val[0], asign);
__m128i a_msign = _mm_and_si128(a, msign);
mulhi = _mm_sub_epi64(mulhi, m_asign);
mulhi = _mm_sub_epi64(mulhi, a_msign);
// q = (a + mulhi) >> sh
__m128i q = _mm_add_epi64(a, mulhi);
// emulate arithmetic right shift
const __m128i sigb = npyv_setall_s64(1LL << 63);
q = _mm_srl_epi64(_mm_add_epi64(q, sigb), divisor.val[1]);
q = _mm_sub_epi64(q, _mm_srl_epi64(sigb, divisor.val[1]));
// q = q - XSIGN(a)
// trunc(a/d) = (q ^ dsign) - dsign
q = _mm_sub_epi64(q, asign);
q = _mm_sub_epi64(_mm_xor_si128(q, divisor.val[2]), divisor.val[2]);
return q;
}

@jan-wassenberg
Copy link
Contributor

hm, are we sure floating-point div instructions are actually used? I also have seen clang transform div into mul in a similar way.

@r-devulap
Copy link
Member

@abhishek-iitmadras could you rebase to main to fix the QEMU Ci failures? Apart from that you still seem to have a few more CI failures:

/run_32_bit_linux_docker.sh: line 14:  2250 Aborted                 (core dumped) python3 -m pytest --pyargs numpy

@abhishek-iitmadras
Copy link
Contributor Author

abhishek-iitmadras commented Jan 2, 2025

Out of all 9 CI failures, I am focusing on the one associated with the Pyodide build, as it is the only one that provides a clear reason for the failure.
Furthermore, there are currently 8 test failures in the Pyodide build (see the GitHub Actions run here: link ).
These failures do not occur on aarch64 or x86 or macOS builds and test.

I would appreciate any insights into why these failures appear only in the Pyodide build, as well as any suggestions on how to resolve them.
Thanks

@abhishek-iitmadras abhishek-iitmadras force-pushed the HWY2 branch 2 times, most recently from dcf8087 to 364d295 Compare January 11, 2025 18:12
@jan-wassenberg
Copy link
Contributor

For anyone interested, here are some techniques for speeding up division by non-constant: http://0x80.pl/notesen/2024-12-21-uint8-division.html
One thing that surprised me is that approximate reciprocal can be made to work by an additional multiply.
There are also integer-only techniques.

@abhishek-iitmadras
Copy link
Contributor Author

abhishek-iitmadras commented Feb 10, 2025

Hi @seiko2plus

I am extremely sorry for the huge delay in replying to you regarding benchmarking performance on AVX512 and AVX2, as I was busy with some other framework optimization.

Below is detail of Performance measures:
AVX512


| Change   | Before [06f987b7] <main>   | After [58d3c8e1] <HWY2>   |   Ratio | Benchmark (Parameter)                                                                         |
|----------|----------------------------|---------------------------|---------|-----------------------------------------------------------------------------------------------|
| +        | 3.35±0.4μs                 | 7.21±0.1μs                |    2.15 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)        |
| +        | 3.36±0.2μs                 | 7.19±0.5μs                |    2.14 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)      |
| +        | 3.36±0.2μs                 | 7.17±0.01μs               |    2.14 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)       |
| +        | 1.36±0μs                   | 2.90±0.01μs               |    2.14 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint8'>, 43)     |
| +        | 3.35±0.2μs                 | 7.15±0.01μs               |    2.13 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)       |
| +        | 1.36±0.01μs                | 2.90±0.02μs               |    2.13 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint8'>, 8)      |
| +        | 4.09±0.9μs                 | 7.39±0.3μs                |    1.81 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint32'>, 43)    |
| +        | 1.70±0μs                   | 2.99±0.01μs               |    1.76 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)        |
| +        | 1.70±0.01μs                | 3.00±0.01μs               |    1.76 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)         |
| +        | 1.71±0.04μs                | 2.99±0.04μs               |    1.75 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)       |
| +        | 1.71±0.01μs                | 2.99±0.01μs               |    1.75 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)        |
| +        | 4.09±0.9μs                 | 7.09±0.1μs                |    1.73 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint32'>, 8)     |
| +        | 1.87±0.02μs                | 2.99±0.1μs                |    1.6  | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)      |
| +        | 1.86±0.02μs                | 2.98±0.01μs               |    1.6  | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)       |
| +        | 1.88±0.02μs                | 2.98±0.01μs               |    1.59 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)       |
| +        | 1.87±0.03μs                | 2.97±0.01μs               |    1.59 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)        |
| +        | 1.87±0.1μs                 | 2.91±0μs                  |    1.56 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint16'>, 43)    |
| +        | 1.87±0.1μs                 | 2.91±0.1μs                |    1.56 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint16'>, 8)     |
| +        | 7.22±0.1μs                 | 10.2±0.2μs                |    1.42 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)        |
| +        | 7.22±0.06μs                | 10.2±0.3μs                |    1.42 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, -8)    |
| +        | 7.24±0.06μs                | 10.2±0.3μs                |    1.41 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)      |
| +        | 7.23±0.2μs                 | 10.2±0.3μs                |    1.41 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)       |
| +        | 7.22±0.06μs                | 10.2±0.2μs                |    1.41 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)       |
| +        | 7.22±0.06μs                | 10.2±0.3μs                |    1.41 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, -43)   |
| +        | 7.22±0.06μs                | 10.2±0.2μs                |    1.41 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, 43)    |
| +        | 7.22±0.05μs                | 10.2±0.2μs                |    1.41 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, 8)     |
| +        | 6.32±0.8μs                 | 7.72±0.05μs               |    1.22 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint64'>, 43)    |
| +        | 6.32±0.9μs                 | 7.74±0.1μs                |    1.22 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint64'>, 8)     |
| +        | 6.32±0.8μs                 | 7.72±0.02μs               |    1.22 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.ulonglong'>, 43) |
| +        | 6.32±0.8μs                 | 7.73±0.2μs                |    1.22 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.ulonglong'>, 8)  |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

AVX2


| Change   | Before [06f987b7] <main>   | After [72a98a6c] <HWY2>   |   Ratio | Benchmark (Parameter)                                                                         |
|----------|----------------------------|---------------------------|---------|-----------------------------------------------------------------------------------------------|
| +        | 3.13±0.02μs                | 7.12±0.01μs               |    2.27 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)      |
| +        | 3.14±0.02μs                | 7.09±0μs                  |    2.26 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)       |
| +        | 3.16±0.03μs                | 7.12±0μs                  |    2.25 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)       |
| +        | 3.16±0.03μs                | 7.09±0.01μs               |    2.24 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)        |
| +        | 3.38±0.01μs                | 7.01±0μs                  |    2.08 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint32'>, 43)    |
| +        | 3.39±0.02μs                | 7.04±0.01μs               |    2.08 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint32'>, 8)     |
| +        | 1.42±0.02μs                | 2.87±0.01μs               |    2.02 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint8'>, 43)     |
| +        | 1.45±0.03μs                | 2.87±0.01μs               |    1.98 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint8'>, 8)      |
| +        | 1.74±0.02μs                | 2.95±0.01μs               |    1.69 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)       |
| +        | 1.74±0.02μs                | 2.95±0.01μs               |    1.69 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)        |
| +        | 1.75±0.03μs                | 2.95±0.01μs               |    1.69 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)        |
| +        | 1.74±0.03μs                | 2.95±0.01μs               |    1.69 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)         |
| +        | 1.96±0.03μs                | 2.97±0.01μs               |    1.52 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)       |
| +        | 1.95±0.02μs                | 2.96±0.01μs               |    1.51 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)      |
| +        | 1.95±0.03μs                | 2.96±0.01μs               |    1.51 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)       |
| +        | 1.96±0.02μs                | 2.95±0.01μs               |    1.51 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)        |
| +        | 7.79±0.3μs                 | 11.5±1μs                  |    1.48 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)      |
| +        | 7.79±0.3μs                 | 11.5±0.6μs                |    1.48 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)       |
| +        | 7.79±0.3μs                 | 11.5±0.6μs                |    1.48 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, -43)   |
| +        | 7.80±0.3μs                 | 11.5±0.6μs                |    1.47 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, -8)    |
| +        | 7.92±0.3μs                 | 11.5±0.6μs                |    1.45 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)       |
| +        | 7.91±0.3μs                 | 11.5±0.6μs                |    1.45 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)        |
| +        | 7.96±0.9μs                 | 11.5±0.6μs                |    1.45 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, 43)    |
| +        | 7.91±0.3μs                 | 11.5±0.6μs                |    1.45 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, 8)     |
| +        | 2.01±0.07μs                | 2.88±0.01μs               |    1.43 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint16'>, 43)    |
| +        | 2.02±0.06μs                | 2.88±0.02μs               |    1.43 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint16'>, 8)     |
| +        | 7.39±0.3μs                 | 8.91±0.6μs                |    1.21 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint64'>, 8)     |
| +        | 7.38±0.3μs                 | 8.91±0.6μs                |    1.21 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.ulonglong'>, 8)  |
| +        | 7.39±0.3μs                 | 8.89±0.6μs                |    1.2  | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint64'>, 43)    |
| +        | 7.38±0.3μs                 | 8.89±0.6μs                |    1.2  | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.ulonglong'>, 43) |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Update commit message accordingly.

@mattip
Copy link
Member

mattip commented Feb 11, 2025

So benchmarks got sigficantly slower?

@abhishek-iitmadras
Copy link
Contributor Author

So benchmarks got sigficantly slower?

Yes, after addressing these 9 CI's failures, I will proceed with optimization.

@abhishek-iitmadras abhishek-iitmadras force-pushed the HWY2 branch 2 times, most recently from cc9e074 to 795cc9b Compare February 19, 2025 05:34
@seiko2plus
Copy link
Member

@abhishek-iitmadras,

I am extremely sorry for the huge delay in replying to you regarding benchmarking performance on AVX512 and AVX2, as I was busy with some other framework optimization.
Below is detail of Performance measures:

In my opinion, some axioms typically don’t require explicit testing. My intention was to guide you toward a more efficient approach to save time and effort. However, these benchmark results don’t seem accurate to me. The AVX2 and AVX512 benchmarks appear equivalent. Are you certain that AVX512 was correctly disabled before running the AVX2 benchmark? Also, have you tried increasing the array length used in the test to improve stability? see:
https://github.com/numpy/numpy/blob/main/benchmarks/benchmarks/bench_ufunc.py#L454

Yes, after addressing these 9 CI's failures, I will proceed with optimization.

What kind of optimization are you planning to perform? Could you elaborate?

@abhishek-iitmadras
Copy link
Contributor Author

abhishek-iitmadras commented Mar 3, 2025

Are you certain that AVX512 was correctly disabled before running the AVX2 benchmark?

I used following method to build for AVX2 only:

1. git clone https://github.com/abhishek-iitmadras/numpy.git
2. cd numpy
3. git checkout HWY2
4. git submodule update --init
5. python -m pip install -r requirements/all_requirements.txt
6. python -m build --wheel \
   -Csetup-args=-Dcpu-dispatch="max \
                                -avx512f  \
                                -avx512cd \
                                -avx512_knl \
                                -avx512_knm \
                                -avx512_skx \
                                -avx512_clx \
                                -avx512_cnl \
                                -avx512_icl \
                                -avx512_spr"
7.  pip install dist/numpy****.whl

Below is screenshot of build configuration:
image

Now for benchmarking:

1. pip install asv
2. pip install virtualenv
3. spin bench -c main -t bench_ufunc.CustomScalarFloorDivide

Correct me if I am doing something wrong. @seiko2plus

Also, have you tried increasing the array length used in the test to improve stability? see: https://github.com/numpy/numpy/blob/main/benchmarks/benchmarks/bench_ufunc.py#L454

No

What kind of optimization are you planning to perform? Could you elaborate?

  1. Replace Div operation with Mul operation with precompute reciprocal .
  2. Replace IfThenElse with MaskedSubOr as per Masked arithmetic
  3. try to Understand various suggestion from @jan-wassenberg and will try to replicate.

I already tried point 1 and 2, after that build is ok but few test if failing even on my aarch64 machine. Maybe i need to get more understanding on this.

@r-devulap r-devulap self-assigned this Mar 3, 2025
@abhishek-iitmadras abhishek-iitmadras force-pushed the HWY2 branch 5 times, most recently from c3ecde7 to 4ac89ba Compare April 14, 2025 17:04
@abhishek-iitmadras abhishek-iitmadras force-pushed the HWY2 branch 5 times, most recently from 5fdd529 to 8699aa9 Compare May 5, 2025 11:54
const auto different_signs = hn::Xor(src_sign, scalar_sign);

auto adjustment = hn::And(different_signs, has_remainder);
vec_div = hn::IfThenElse(adjustment, hn::Sub(vec_div, hn::Set(d, static_cast<T>(1))), vec_div);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of IfThenElse(adjust, Sub(div, one), div) we can use MaskedSubOr(div, adjust, div, one). This would be faster on AVX-512.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

const auto has_remainder = hn::Ne(vec_src, vec_mul);
const auto src_sign = hn::Lt(vec_src, vec_zero);
const auto scalar_sign = hn::Lt(vec_scalar, vec_zero);
const auto different_signs = hn::Xor(src_sign, scalar_sign);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of Lt, Lt, Xor, we can Xor src and scalar, and then test only that one sign bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

const auto vec_src = hn::LoadU(d, src + i);
auto vec_div = hn::Div(vec_src, vec_scalar);
const auto vec_mul = hn::Mul(vec_div, vec_scalar);
const auto has_remainder = hn::Ne(vec_src, vec_mul);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of And(diff, Ne(src, mul)) consider AndNot(Eq(src, mul), diff) - bit faster on x86.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done
thanks @jan-wassenberg

@abhishek-iitmadras abhishek-iitmadras force-pushed the HWY2 branch 2 times, most recently from 218f319 to 512e512 Compare May 6, 2025 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: SIMD Issues in SIMD (fast instruction sets) code or machinery HWY features related to google Highway
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants