ENH: Convert arithmetic from C universal intrinsics to C++ using Highway #27402

abhishek-iitmadras · 2024-09-16T16:36:17Z

This patch is all about to convert loops_arithmetic.dispatch.c.src to numpy/_core/src/umath/loops_arithmetic.dispatch.cpp using Google Highway as per NEP54 resolutions.

All Requirements are passed:

Build should be successful on all the relevant platforms
All tests should pass ,specially relevant test like numpy/_core/tests/test_ufunc.py, numpy/_core/tests/test_umath.py and numpy/_core/tests/test_umath_accuracy.py.
performance regressions upto ~2.7x
passes CI on all the relevant platforms (only tested once raised OSS PR)

r-devulap · 2024-09-26T23:18:42Z

numpy/_core/src/multiarray/common.h

     || (defined __clang__ && __clang_major__ < 8))
-# define NPY_ALIGNOF(type) offsetof(struct {char c; type v;}, v)
+#define NPY_ALIGNOF(type) __alignof__(type)


What is the motivation to change these macro's? You don't seem to be using them anywhere either?

Hi @r-devulap
Change to the NPY_ALIGNOF macro, replacing offsetof with __alignof__, is motivated by the need for compliance with modern C++ standards and to eliminate compiler errors. The original use of offsetof with an anonymous struct may not be standard in C++11 and later, which leads to errors as shown in pic.

On the other hand, modern compilers provide alignof, which is built-in, efficient, and compatible with both GCC and Clang, ensuring adherence to C++11 standards and simplifying alignment calculation

On GCC and Clang (Linux/macOS), the __alignof__ operator is supported and returns the required alignment of a type.

On MSVC, __alignof__ is not recognized, so alignof can be used.

This causes an issue with failing of Window CI with MSVC compiler.

The original use of offsetof with an anonymous struct may not be standard in C++11 and later, which leads to errors as shown in pic.

Does this happen on the main branch as well? I suggest we make a separate PR for this change cos it seems unrelated to this patch.

Does this happen on the main branch as well?

No

r-devulap · 2024-10-04T19:47:35Z

@abhishek-iitmadras Is this patch still WIP or ready for review? I noticed this patch is missing the vsx4 specific optimizations in the original source file:

numpy/numpy/_core/src/umath/loops_arithmetic.dispatch.c.src

Lines 221 to 326 in 49e0743

    
           static inline void 
        
           vsx4_simd_divide_contig_@sfx@(char **args, npy_intp len) 
        
           { 
        
               npyv_lanetype_@sfx@ *src1 = (npyv_lanetype_@sfx@ *) args[0]; 
        
               npyv_lanetype_@sfx@ *src2 = (npyv_lanetype_@sfx@ *) args[1]; 
        
               npyv_lanetype_@sfx@ *dst1 = (npyv_lanetype_@sfx@ *) args[2]; 
        
               const npyv_@sfx@ vzero    = npyv_zero_@sfx@(); 
        
               const int vstep           = npyv_nlanes_@sfx@; 
        
               for (; len >= vstep; len -= vstep, src1 += vstep, src2 += vstep, 
        
                    dst1 += vstep) { 
        
                   npyv_@sfx@ a = npyv_load_@sfx@(src1); 
        
                   npyv_@sfx@ b = npyv_load_@sfx@(src2); 
        
                   npyv_@sfx@ c = vsx4_div_@sfx@(a, b); 
        
                   npyv_store_@sfx@(dst1, c); 
        
                   if (NPY_UNLIKELY(vec_any_eq(b, vzero))) { 
        
                       npy_set_floatstatus_divbyzero(); 
        
                   } 
        
               } 
        
               for (; len > 0; --len, ++src1, ++src2, ++dst1) { 
        
                   const npyv_lanetype_@sfx@ a = *src1; 
        
                   const npyv_lanetype_@sfx@ b = *src2; 
        
                   if (NPY_UNLIKELY(b == 0)) { 
        
                       npy_set_floatstatus_divbyzero(); 
        
                       *dst1 = 0; 
        
                   } else{ 
        
                       *dst1 = a / b; 
        
                   } 
        
               } 
        
               npyv_cleanup(); 
        
           } 
        
           /**end repeat**/ 
        
           /**begin repeat 
        
            * Signed types 
        
            * #sfx = s8, s16, s32, s64# 
        
            * #len = 8,  16,  32,  64# 
        
            */ 
        
           static inline void 
        
           vsx4_simd_divide_contig_@sfx@(char **args, npy_intp len) 
        
           { 
        
               npyv_lanetype_@sfx@ *src1 = (npyv_lanetype_@sfx@ *) args[0]; 
        
               npyv_lanetype_@sfx@ *src2 = (npyv_lanetype_@sfx@ *) args[1]; 
        
               npyv_lanetype_@sfx@ *dst1 = (npyv_lanetype_@sfx@ *) args[2]; 
        
               const npyv_@sfx@ vneg_one = npyv_setall_@sfx@(-1); 
        
               const npyv_@sfx@ vzero    = npyv_zero_@sfx@(); 
        
               const npyv_@sfx@ vmin     = npyv_setall_@sfx@(NPY_MIN_INT@len@); 
        
               npyv_b@len@ warn_zero     = npyv_cvt_b@len@_@sfx@(npyv_zero_@sfx@()); 
        
               npyv_b@len@ warn_overflow = npyv_cvt_b@len@_@sfx@(npyv_zero_@sfx@()); 
        
               const int vstep           = npyv_nlanes_@sfx@; 
        
               for (; len >= vstep; len -= vstep, src1 += vstep, src2 += vstep, 
        
                    dst1 += vstep) { 
        
                   npyv_@sfx@ a   = npyv_load_@sfx@(src1); 
        
                   npyv_@sfx@ b   = npyv_load_@sfx@(src2); 
        
                   npyv_@sfx@ quo = vsx4_div_@sfx@(a, b); 
        
                   npyv_@sfx@ rem = npyv_sub_@sfx@(a, vec_mul(b, quo)); 
        
                   // (b == 0 || (a == NPY_MIN_INT@len@ && b == -1)) 
        
                   npyv_b@len@ bzero    = npyv_cmpeq_@sfx@(b, vzero); 
        
                   npyv_b@len@ amin     = npyv_cmpeq_@sfx@(a, vmin); 
        
                   npyv_b@len@ bneg_one = npyv_cmpeq_@sfx@(b, vneg_one); 
        
                   npyv_b@len@ overflow = npyv_and_@sfx@(bneg_one, amin); 
        
                              warn_zero = npyv_or_@sfx@(bzero, warn_zero); 
        
                          warn_overflow = npyv_or_@sfx@(overflow, warn_overflow); 
        
                   // handle mixed case the way Python does 
        
                   // ((a > 0) == (b > 0) || rem == 0) 
        
                   npyv_b@len@ a_gt_zero  = npyv_cmpgt_@sfx@(a, vzero); 
        
                   npyv_b@len@ b_gt_zero  = npyv_cmpgt_@sfx@(b, vzero); 
        
                   npyv_b@len@ ab_eq_cond = npyv_cmpeq_@sfx@(a_gt_zero, b_gt_zero); 
        
                   npyv_b@len@ rem_zero   = npyv_cmpeq_@sfx@(rem, vzero); 
        
                   npyv_b@len@ or         = npyv_or_@sfx@(ab_eq_cond, rem_zero); 
        
                   npyv_@sfx@ to_sub = npyv_select_@sfx@(or, vzero, vneg_one); 
        
                                 quo = npyv_add_@sfx@(quo, to_sub); 
        
                                 // Divide by zero 
        
                                 quo = npyv_select_@sfx@(bzero, vzero, quo); 
        
                                 // Overflow 
        
                                 quo = npyv_select_@sfx@(overflow, vmin, quo); 
        
                   npyv_store_@sfx@(dst1, quo); 
        
               } 
        
               if (!vec_all_eq(warn_zero, vzero)) { 
        
                   npy_set_floatstatus_divbyzero(); 
        
               } 
        
               if (!vec_all_eq(warn_overflow, vzero)) { 
        
                   npy_set_floatstatus_overflow(); 
        
               } 
        
               for (; len > 0; --len, ++src1, ++src2, ++dst1) { 
        
                   const npyv_lanetype_@sfx@ a = *src1; 
        
                   const npyv_lanetype_@sfx@ b = *src2; 
        
                   if (NPY_UNLIKELY(b == 0)) { 
        
                       npy_set_floatstatus_divbyzero(); 
        
                       *dst1 = 0; 
        
                   } else if (NPY_UNLIKELY((a == NPY_MIN_INT@len@) && (b == -1))) { 
        
                       npy_set_floatstatus_overflow(); 
        
                       *dst1 = NPY_MIN_INT@len@; 
        
                   } else { 
        
                       *dst1 = a / b; 
        
                       if (((a > 0) != (b > 0)) && ((*dst1 * b) != a)) { 
        
                           *dst1 -= 1; 
        
                       } 
        
                   } 
        
               } 
        
               npyv_cleanup(); 
        
           }

abhishek-iitmadras · 2024-10-04T20:42:06Z

Is this patch still WIP or ready for review?

I am currently trying to modify the code to ensure all CI tests pass. At present, I am facing a challenge in obtaining a Windows machine with the MSVC compiler.

jan-wassenberg · 2024-10-07T17:10:43Z

numpy/_core/src/umath/loops_arithmetic.dispatch.cpp

+        memcpy(dst, src, len * sizeof(T));
+    } else if (scalar == static_cast<T>(-1)) {
+        const auto vec_min_val = Set(d, std::numeric_limits<T>::min());
+        for (npy_intp i = 0; i < len; i += N) {


I strongly recommend splitting up loops into the main body, and a tail. The LoadN/StoreN functions are quite expensive, so in the main loop they should be replaced with LoadU/StoreU.

jan-wassenberg · 2024-10-07T17:15:04Z

I am currently trying to modify the code to ensure all CI tests pass. At present, I am facing a challenge in obtaining a Windows machine with the MSVC compiler.

Might it help to run MSVC under an emulator (WINE)?

abhishek-iitmadras · 2024-10-20T18:05:37Z

numpy/_core/src/common/npy_atomic.h

+    return (void *)(*(volatile uint64_t *)obj);
 #elif defined(_M_ARM64)
-    return (uint64_t)__ldar64((unsigned __int64 volatile *)obj);
+    return (void *)(uint64_t)__ldar64((unsigned __int64 volatile *)obj);
 #endif
 #else
 #if defined(_M_X64) || defined(_M_IX86)
-    return *(volatile uint32_t *)obj;
+    return (void *)(*(volatile uint32_t *)obj);
 #elif defined(_M_ARM64)
-    return (uint32_t)__ldar32((unsigned __int32 volatile *)obj);
+    return (void *)(uint32_t)__ldar32((unsigned __int32 volatile *)obj);


This window fix has already been merged, causing a conflict here.

seiko2plus · 2024-10-22T04:22:54Z

@abhishek-iitmadras,

No performance regressions.

It's surprising to hear there are no performance regressions, especially on architectures like x86. As far as I understand, Highway will fall back to the FPU unit to handle integer vector division on architectures that don't provide native instructions for it. This fallback could potentially introduce a performance hit.

The C code, on the other hand, relies on division by invariant integers using multiplication, as mentioned at the beginning of the C source file:

numpy/numpy/_core/src/umath/loops_arithmetic.dispatch.c.src

Lines 25 to 27 in 8c2476b

    
           * Floor division of signed is based on T. Granlund and P. L. Montgomery 
        
           * "Division by invariant integers using multiplication(see [Figure 6.1] 
        
           * https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556)"

For further clarification, please take a look at the following sources:

numpy/numpy/_core/src/common/simd/intdiv.h

Lines 11 to 79 in d315181

    
           /********************************************************************************** 
        
            ** Integer division 
        
            ********************************************************************************** 
        
            * Almost all architecture (except Power10) doesn't support integer vector division, 
        
            * also the cost of scalar division in architectures like x86 is too high it can take 
        
            * 30 to 40 cycles on modern chips and up to 100 on old ones. 
        
            * 
        
            * Therefore we are using division by multiplying with precomputed reciprocal technique, 
        
            * the method that been used in this implementation is based on T. Granlund and P. L. Montgomery 
        
            * “Division by invariant integers using multiplication(see [Figure 4.1, 5.1] 
        
            * https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556) 
        
            * 
        
            * It shows a good impact for all architectures especially on X86, 
        
            * however computing divisor parameters is kind of expensive so this implementation 
        
            * should only works when divisor is a scalar and used multiple of times. 
        
            * 
        
            * The division process is separated into two intrinsics for each data type 
        
            * 
        
            *  1- npyv_{dtype}x3 npyv_divisor_{dtype} ({dtype} divisor); 
        
            *     For computing the divisor parameters (multiplier + shifters + sign of divisor(signed only)) 
        
            * 
        
            *  2- npyv_{dtype} npyv_divisor_{dtype} (npyv_{dtype} dividend, npyv_{dtype}x3 divisor_parms); 
        
            *     For performing the final division. 
        
            * 
        
            ** For example: 
        
            *    int vstep = npyv_nlanes_s32;                // number of lanes 
        
            *    int x     = 0x6e70; 
        
            *    npyv_s32x3 divisor = npyv_divisor_s32(x);   // init divisor params 
        
            *    for (; len >= vstep; src += vstep, dst += vstep, len -= vstep) { 
        
            *        npyv_s32 a = npyv_load_s32(*src);       // load s32 vector from memory 
        
            *                 a = npyv_divc_s32(a, divisor); // divide all elements by x 
        
            *        npyv_store_s32(dst, a);                 // store s32 vector into memory 
        
            *    } 
        
            * 
        
            ** NOTES: 
        
            *  - For 64-bit division on Aarch64 and IBM/Power, we fall-back to the scalar division 
        
            *    since emulating multiply-high is expensive and both architectures have very fast dividers. 
        
            * 
        
            *************************************************************** 
        
            ** Figure 4.1: Unsigned division by run–time invariant divisor 
        
            *************************************************************** 
        
            * Initialization (given uword d with 1 ≤ d < 2^N): 
        
            *    int l   = ceil(log2(d)); 
        
            *    uword m = 2^N * (2^l− d) / d + 1; 
        
            *    int sh1 = min(l, 1); 
        
            *    int sh2 = max(l − 1, 0); 
        
            * 
        
            * For q = FLOOR(a/d), all uword: 
        
            *    uword t1 = MULUH(m, a); 
        
            *    q = SRL(t1 + SRL(a − t1, sh1), sh2); 
        
            * 
        
            ************************************************************************************ 
        
            ** Figure 5.1: Signed division by run–time invariant divisor, rounded towards zero 
        
            ************************************************************************************ 
        
            * Initialization (given constant sword d with d !=0): 
        
            *    int l       = max(ceil(log2(abs(d))), 1); 
        
            *    udword m0   = 1 + (2^(N+l-1)) / abs(d); 
        
            *    sword  m    = m0 − 2^N; 
        
            *    sword dsign = XSIGN(d); 
        
            *    int sh      = l − 1; 
        
            * 
        
            * For q = TRUNC(a/d), all sword: 
        
            *    sword q0 = a + MULSH(m, a); 
        
            *          q0 = SRA(q0, sh) − XSIGN(a); 
        
            *    q = EOR(q0, dsign) − dsign; 
        
            */ 
        
           /** 
        
            * bit-scan reverse for non-zeros. returns the index of the highest set bit. 
        
            * equivalent to floor(log2(a))

numpy/numpy/_core/src/common/simd/sse/arithmetic.h

Lines 88 to 262 in 8c2476b

    
           /*************************** 
        
            * Integer Division 
        
            ***************************/ 
        
           // See simd/intdiv.h for more clarification 
        
           // divide each unsigned 8-bit element by a precomputed divisor 
        
           NPY_FINLINE npyv_u8 npyv_divc_u8(npyv_u8 a, const npyv_u8x3 divisor) 
        
           { 
        
               const __m128i bmask = _mm_set1_epi32(0x00FF00FF); 
        
               const __m128i shf1b = _mm_set1_epi8(0xFFU >> _mm_cvtsi128_si32(divisor.val[1])); 
        
               const __m128i shf2b = _mm_set1_epi8(0xFFU >> _mm_cvtsi128_si32(divisor.val[2])); 
        
               // high part of unsigned multiplication 
        
               __m128i mulhi_even  = _mm_mullo_epi16(_mm_and_si128(a, bmask), divisor.val[0]); 
        
               __m128i mulhi_odd   = _mm_mullo_epi16(_mm_srli_epi16(a, 8), divisor.val[0]); 
        
                       mulhi_even  = _mm_srli_epi16(mulhi_even, 8); 
        
               __m128i mulhi       = npyv_select_u8(bmask, mulhi_even, mulhi_odd); 
        
               // floor(a/d)       = (mulhi + ((a-mulhi) >> sh1)) >> sh2 
        
               __m128i q           = _mm_sub_epi8(a, mulhi); 
        
                       q           = _mm_and_si128(_mm_srl_epi16(q, divisor.val[1]), shf1b); 
        
                       q           = _mm_add_epi8(mulhi, q); 
        
                       q           = _mm_and_si128(_mm_srl_epi16(q, divisor.val[2]), shf2b); 
        
               return  q; 
        
           } 
        
           // divide each signed 8-bit element by a precomputed divisor (round towards zero) 
        
           NPY_FINLINE npyv_s16 npyv_divc_s16(npyv_s16 a, const npyv_s16x3 divisor); 
        
           NPY_FINLINE npyv_s8 npyv_divc_s8(npyv_s8 a, const npyv_s8x3 divisor) 
        
           { 
        
               const __m128i bmask = _mm_set1_epi32(0x00FF00FF); 
        
               // instead of _mm_cvtepi8_epi16/_mm_packs_epi16 to wrap around overflow 
        
               __m128i divc_even = npyv_divc_s16(_mm_srai_epi16(_mm_slli_epi16(a, 8), 8), divisor); 
        
               __m128i divc_odd  = npyv_divc_s16(_mm_srai_epi16(a, 8), divisor); 
        
                       divc_odd  = _mm_slli_epi16(divc_odd, 8); 
        
               return npyv_select_u8(bmask, divc_even, divc_odd); 
        
           } 
        
           // divide each unsigned 16-bit element by a precomputed divisor 
        
           NPY_FINLINE npyv_u16 npyv_divc_u16(npyv_u16 a, const npyv_u16x3 divisor) 
        
           { 
        
               // high part of unsigned multiplication 
        
               __m128i mulhi = _mm_mulhi_epu16(a, divisor.val[0]); 
        
               // floor(a/d) = (mulhi + ((a-mulhi) >> sh1)) >> sh2 
        
               __m128i q     = _mm_sub_epi16(a, mulhi); 
        
                       q     = _mm_srl_epi16(q, divisor.val[1]); 
        
                       q     = _mm_add_epi16(mulhi, q); 
        
                       q     = _mm_srl_epi16(q, divisor.val[2]); 
        
               return  q; 
        
           } 
        
           // divide each signed 16-bit element by a precomputed divisor (round towards zero) 
        
           NPY_FINLINE npyv_s16 npyv_divc_s16(npyv_s16 a, const npyv_s16x3 divisor) 
        
           { 
        
               // high part of signed multiplication 
        
               __m128i mulhi = _mm_mulhi_epi16(a, divisor.val[0]); 
        
               // q          = ((a + mulhi) >> sh1) - XSIGN(a) 
        
               // trunc(a/d) = (q ^ dsign) - dsign 
        
               __m128i q     = _mm_sra_epi16(_mm_add_epi16(a, mulhi), divisor.val[1]); 
        
                       q     = _mm_sub_epi16(q, _mm_srai_epi16(a, 15)); 
        
                       q     = _mm_sub_epi16(_mm_xor_si128(q, divisor.val[2]), divisor.val[2]); 
        
               return  q; 
        
           } 
        
           // divide each unsigned 32-bit element by a precomputed divisor 
        
           NPY_FINLINE npyv_u32 npyv_divc_u32(npyv_u32 a, const npyv_u32x3 divisor) 
        
           { 
        
               // high part of unsigned multiplication 
        
               __m128i mulhi_even = _mm_srli_epi64(_mm_mul_epu32(a, divisor.val[0]), 32); 
        
               __m128i mulhi_odd  = _mm_mul_epu32(_mm_srli_epi64(a, 32), divisor.val[0]); 
        
           #ifdef NPY_HAVE_SSE41 
        
               __m128i mulhi      = _mm_blend_epi16(mulhi_even, mulhi_odd, 0xCC); 
        
           #else 
        
               __m128i mask_13    = _mm_setr_epi32(0, -1, 0, -1); 
        
                      mulhi_odd   = _mm_and_si128(mulhi_odd, mask_13); 
        
               __m128i mulhi      = _mm_or_si128(mulhi_even, mulhi_odd); 
        
           #endif 
        
               // floor(a/d)      = (mulhi + ((a-mulhi) >> sh1)) >> sh2 
        
               __m128i q          = _mm_sub_epi32(a, mulhi); 
        
                       q          = _mm_srl_epi32(q, divisor.val[1]); 
        
                       q          = _mm_add_epi32(mulhi, q); 
        
                       q          = _mm_srl_epi32(q, divisor.val[2]); 
        
               return  q; 
        
           } 
        
           // divide each signed 32-bit element by a precomputed divisor (round towards zero) 
        
           NPY_FINLINE npyv_s32 npyv_divc_s32(npyv_s32 a, const npyv_s32x3 divisor) 
        
           { 
        
               __m128i asign      = _mm_srai_epi32(a, 31); 
        
           #ifdef NPY_HAVE_SSE41 
        
               // high part of signed multiplication 
        
               __m128i mulhi_even = _mm_srli_epi64(_mm_mul_epi32(a, divisor.val[0]), 32); 
        
               __m128i mulhi_odd  = _mm_mul_epi32(_mm_srli_epi64(a, 32), divisor.val[0]); 
        
               __m128i mulhi      = _mm_blend_epi16(mulhi_even, mulhi_odd, 0xCC); 
        
           #else  // not SSE4.1 
        
               // high part of "unsigned" multiplication 
        
               __m128i mulhi_even = _mm_srli_epi64(_mm_mul_epu32(a, divisor.val[0]), 32); 
        
               __m128i mulhi_odd  = _mm_mul_epu32(_mm_srli_epi64(a, 32), divisor.val[0]); 
        
               __m128i mask_13    = _mm_setr_epi32(0, -1, 0, -1); 
        
                       mulhi_odd  = _mm_and_si128(mulhi_odd, mask_13); 
        
               __m128i mulhi      = _mm_or_si128(mulhi_even, mulhi_odd); 
        
               // convert unsigned to signed high multiplication 
        
               // mulhi - ((a < 0) ? m : 0) - ((m < 0) ? a : 0); 
        
               const __m128i msign= _mm_srai_epi32(divisor.val[0], 31); 
        
               __m128i m_asign    = _mm_and_si128(divisor.val[0], asign); 
        
               __m128i a_msign    = _mm_and_si128(a, msign); 
        
                       mulhi      = _mm_sub_epi32(mulhi, m_asign); 
        
                       mulhi      = _mm_sub_epi32(mulhi, a_msign); 
        
           #endif 
        
               // q               = ((a + mulhi) >> sh1) - XSIGN(a) 
        
               // trunc(a/d)      = (q ^ dsign) - dsign 
        
               __m128i q          = _mm_sra_epi32(_mm_add_epi32(a, mulhi), divisor.val[1]); 
        
                       q          = _mm_sub_epi32(q, asign); 
        
                       q          = _mm_sub_epi32(_mm_xor_si128(q, divisor.val[2]), divisor.val[2]); 
        
               return  q; 
        
           } 
        
           // returns the high 64 bits of unsigned 64-bit multiplication 
        
           // xref https://stackoverflow.com/a/28827013 
        
           NPY_FINLINE npyv_u64 npyv__mullhi_u64(npyv_u64 a, npyv_u64 b) 
        
           { 
        
               __m128i lomask = npyv_setall_s64(0xffffffff); 
        
               __m128i a_hi   = _mm_srli_epi64(a, 32);        // a0l, a0h, a1l, a1h 
        
               __m128i b_hi   = _mm_srli_epi64(b, 32);        // b0l, b0h, b1l, b1h 
        
               // compute partial products 
        
               __m128i w0     = _mm_mul_epu32(a, b);          // a0l*b0l, a1l*b1l 
        
               __m128i w1     = _mm_mul_epu32(a, b_hi);       // a0l*b0h, a1l*b1h 
        
               __m128i w2     = _mm_mul_epu32(a_hi, b);       // a0h*b0l, a1h*b0l 
        
               __m128i w3     = _mm_mul_epu32(a_hi, b_hi);    // a0h*b0h, a1h*b1h 
        
               // sum partial products 
        
               __m128i w0h    = _mm_srli_epi64(w0, 32); 
        
               __m128i s1     = _mm_add_epi64(w1, w0h); 
        
               __m128i s1l    = _mm_and_si128(s1, lomask); 
        
               __m128i s1h    = _mm_srli_epi64(s1, 32); 
        
               __m128i s2     = _mm_add_epi64(w2, s1l); 
        
               __m128i s2h    = _mm_srli_epi64(s2, 32); 
        
               __m128i hi     = _mm_add_epi64(w3, s1h); 
        
                       hi     = _mm_add_epi64(hi, s2h); 
        
               return hi; 
        
           } 
        
           // divide each unsigned 64-bit element by a precomputed divisor 
        
           NPY_FINLINE npyv_u64 npyv_divc_u64(npyv_u64 a, const npyv_u64x3 divisor) 
        
           { 
        
               // high part of unsigned multiplication 
        
               __m128i mulhi = npyv__mullhi_u64(a, divisor.val[0]); 
        
               // floor(a/d) = (mulhi + ((a-mulhi) >> sh1)) >> sh2 
        
               __m128i q     = _mm_sub_epi64(a, mulhi); 
        
                       q     = _mm_srl_epi64(q, divisor.val[1]); 
        
                       q     = _mm_add_epi64(mulhi, q); 
        
                       q     = _mm_srl_epi64(q, divisor.val[2]); 
        
               return  q; 
        
           } 
        
           // divide each signed 64-bit element by a precomputed divisor (round towards zero) 
        
           NPY_FINLINE npyv_s64 npyv_divc_s64(npyv_s64 a, const npyv_s64x3 divisor) 
        
           { 
        
               // high part of unsigned multiplication 
        
               __m128i mulhi      = npyv__mullhi_u64(a, divisor.val[0]); 
        
               // convert unsigned to signed high multiplication 
        
               // mulhi - ((a < 0) ? m : 0) - ((m < 0) ? a : 0); 
        
           #ifdef NPY_HAVE_SSE42 
        
               const __m128i msign= _mm_cmpgt_epi64(_mm_setzero_si128(), divisor.val[0]); 
        
               __m128i asign      = _mm_cmpgt_epi64(_mm_setzero_si128(), a); 
        
           #else 
        
               const __m128i msign= _mm_srai_epi32(_mm_shuffle_epi32(divisor.val[0], _MM_SHUFFLE(3, 3, 1, 1)), 31); 
        
               __m128i asign      = _mm_srai_epi32(_mm_shuffle_epi32(a, _MM_SHUFFLE(3, 3, 1, 1)), 31); 
        
           #endif 
        
               __m128i m_asign    = _mm_and_si128(divisor.val[0], asign); 
        
               __m128i a_msign    = _mm_and_si128(a, msign); 
        
                       mulhi      = _mm_sub_epi64(mulhi, m_asign); 
        
                       mulhi      = _mm_sub_epi64(mulhi, a_msign); 
        
               // q               = (a + mulhi) >> sh 
        
               __m128i q          = _mm_add_epi64(a, mulhi); 
        
               // emulate arithmetic right shift 
        
               const __m128i sigb = npyv_setall_s64(1LL << 63); 
        
                       q          = _mm_srl_epi64(_mm_add_epi64(q, sigb), divisor.val[1]); 
        
                       q          = _mm_sub_epi64(q, _mm_srl_epi64(sigb, divisor.val[1])); 
        
               // q               = q - XSIGN(a) 
        
               // trunc(a/d)      = (q ^ dsign) - dsign 
        
                       q          = _mm_sub_epi64(q, asign); 
        
                       q          = _mm_sub_epi64(_mm_xor_si128(q, divisor.val[2]), divisor.val[2]); 
        
               return  q; 
        
           }

jan-wassenberg · 2024-10-22T11:01:19Z

hm, are we sure floating-point div instructions are actually used? I also have seen clang transform div into mul in a similar way.

r-devulap · 2024-10-22T23:19:40Z

@abhishek-iitmadras could you rebase to main to fix the QEMU Ci failures? Apart from that you still seem to have a few more CI failures:

/run_32_bit_linux_docker.sh: line 14:  2250 Aborted                 (core dumped) python3 -m pytest --pyargs numpy

abhishek-iitmadras · 2025-01-02T20:40:51Z

Out of all 9 CI failures, I am focusing on the one associated with the Pyodide build, as it is the only one that provides a clear reason for the failure.
Furthermore, there are currently 8 test failures in the Pyodide build (see the GitHub Actions run here: link ).
These failures do not occur on aarch64 or x86 or macOS builds and test.

I would appreciate any insights into why these failures appear only in the Pyodide build, as well as any suggestions on how to resolve them.
Thanks

jan-wassenberg · 2025-02-03T16:18:18Z

For anyone interested, here are some techniques for speeding up division by non-constant: http://0x80.pl/notesen/2024-12-21-uint8-division.html
One thing that surprised me is that approximate reciprocal can be made to work by an additional multiply.
There are also integer-only techniques.

abhishek-iitmadras · 2025-02-10T16:29:17Z

Hi @seiko2plus

I am extremely sorry for the huge delay in replying to you regarding benchmarking performance on AVX512 and AVX2, as I was busy with some other framework optimization.

Below is detail of Performance measures:
AVX512


| Change   | Before [06f987b7] <main>   | After [58d3c8e1] <HWY2>   |   Ratio | Benchmark (Parameter)                                                                         |
|----------|----------------------------|---------------------------|---------|-----------------------------------------------------------------------------------------------|
| +        | 3.35±0.4μs                 | 7.21±0.1μs                |    2.15 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)        |
| +        | 3.36±0.2μs                 | 7.19±0.5μs                |    2.14 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)      |
| +        | 3.36±0.2μs                 | 7.17±0.01μs               |    2.14 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)       |
| +        | 1.36±0μs                   | 2.90±0.01μs               |    2.14 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint8'>, 43)     |
| +        | 3.35±0.2μs                 | 7.15±0.01μs               |    2.13 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)       |
| +        | 1.36±0.01μs                | 2.90±0.02μs               |    2.13 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint8'>, 8)      |
| +        | 4.09±0.9μs                 | 7.39±0.3μs                |    1.81 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint32'>, 43)    |
| +        | 1.70±0μs                   | 2.99±0.01μs               |    1.76 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)        |
| +        | 1.70±0.01μs                | 3.00±0.01μs               |    1.76 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)         |
| +        | 1.71±0.04μs                | 2.99±0.04μs               |    1.75 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)       |
| +        | 1.71±0.01μs                | 2.99±0.01μs               |    1.75 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)        |
| +        | 4.09±0.9μs                 | 7.09±0.1μs                |    1.73 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint32'>, 8)     |
| +        | 1.87±0.02μs                | 2.99±0.1μs                |    1.6  | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)      |
| +        | 1.86±0.02μs                | 2.98±0.01μs               |    1.6  | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)       |
| +        | 1.88±0.02μs                | 2.98±0.01μs               |    1.59 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)       |
| +        | 1.87±0.03μs                | 2.97±0.01μs               |    1.59 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)        |
| +        | 1.87±0.1μs                 | 2.91±0μs                  |    1.56 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint16'>, 43)    |
| +        | 1.87±0.1μs                 | 2.91±0.1μs                |    1.56 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint16'>, 8)     |
| +        | 7.22±0.1μs                 | 10.2±0.2μs                |    1.42 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)        |
| +        | 7.22±0.06μs                | 10.2±0.3μs                |    1.42 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, -8)    |
| +        | 7.24±0.06μs                | 10.2±0.3μs                |    1.41 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)      |
| +        | 7.23±0.2μs                 | 10.2±0.3μs                |    1.41 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)       |
| +        | 7.22±0.06μs                | 10.2±0.2μs                |    1.41 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)       |
| +        | 7.22±0.06μs                | 10.2±0.3μs                |    1.41 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, -43)   |
| +        | 7.22±0.06μs                | 10.2±0.2μs                |    1.41 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, 43)    |
| +        | 7.22±0.05μs                | 10.2±0.2μs                |    1.41 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, 8)     |
| +        | 6.32±0.8μs                 | 7.72±0.05μs               |    1.22 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint64'>, 43)    |
| +        | 6.32±0.9μs                 | 7.74±0.1μs                |    1.22 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint64'>, 8)     |
| +        | 6.32±0.8μs                 | 7.72±0.02μs               |    1.22 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.ulonglong'>, 43) |
| +        | 6.32±0.8μs                 | 7.73±0.2μs                |    1.22 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.ulonglong'>, 8)  |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

AVX2


| Change   | Before [06f987b7] <main>   | After [72a98a6c] <HWY2>   |   Ratio | Benchmark (Parameter)                                                                         |
|----------|----------------------------|---------------------------|---------|-----------------------------------------------------------------------------------------------|
| +        | 3.13±0.02μs                | 7.12±0.01μs               |    2.27 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)      |
| +        | 3.14±0.02μs                | 7.09±0μs                  |    2.26 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)       |
| +        | 3.16±0.03μs                | 7.12±0μs                  |    2.25 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)       |
| +        | 3.16±0.03μs                | 7.09±0.01μs               |    2.24 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)        |
| +        | 3.38±0.01μs                | 7.01±0μs                  |    2.08 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint32'>, 43)    |
| +        | 3.39±0.02μs                | 7.04±0.01μs               |    2.08 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint32'>, 8)     |
| +        | 1.42±0.02μs                | 2.87±0.01μs               |    2.02 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint8'>, 43)     |
| +        | 1.45±0.03μs                | 2.87±0.01μs               |    1.98 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint8'>, 8)      |
| +        | 1.74±0.02μs                | 2.95±0.01μs               |    1.69 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)       |
| +        | 1.74±0.02μs                | 2.95±0.01μs               |    1.69 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)        |
| +        | 1.75±0.03μs                | 2.95±0.01μs               |    1.69 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)        |
| +        | 1.74±0.03μs                | 2.95±0.01μs               |    1.69 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)         |
| +        | 1.96±0.03μs                | 2.97±0.01μs               |    1.52 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)       |
| +        | 1.95±0.02μs                | 2.96±0.01μs               |    1.51 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)      |
| +        | 1.95±0.03μs                | 2.96±0.01μs               |    1.51 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)       |
| +        | 1.96±0.02μs                | 2.95±0.01μs               |    1.51 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)        |
| +        | 7.79±0.3μs                 | 11.5±1μs                  |    1.48 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)      |
| +        | 7.79±0.3μs                 | 11.5±0.6μs                |    1.48 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)       |
| +        | 7.79±0.3μs                 | 11.5±0.6μs                |    1.48 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, -43)   |
| +        | 7.80±0.3μs                 | 11.5±0.6μs                |    1.47 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, -8)    |
| +        | 7.92±0.3μs                 | 11.5±0.6μs                |    1.45 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)       |
| +        | 7.91±0.3μs                 | 11.5±0.6μs                |    1.45 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)        |
| +        | 7.96±0.9μs                 | 11.5±0.6μs                |    1.45 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, 43)    |
| +        | 7.91±0.3μs                 | 11.5±0.6μs                |    1.45 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, 8)     |
| +        | 2.01±0.07μs                | 2.88±0.01μs               |    1.43 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint16'>, 43)    |
| +        | 2.02±0.06μs                | 2.88±0.02μs               |    1.43 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint16'>, 8)     |
| +        | 7.39±0.3μs                 | 8.91±0.6μs                |    1.21 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint64'>, 8)     |
| +        | 7.38±0.3μs                 | 8.91±0.6μs                |    1.21 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.ulonglong'>, 8)  |
| +        | 7.39±0.3μs                 | 8.89±0.6μs                |    1.2  | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint64'>, 43)    |
| +        | 7.38±0.3μs                 | 8.89±0.6μs                |    1.2  | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.ulonglong'>, 43) |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Update commit message accordingly.

mattip · 2025-02-11T05:59:07Z

So benchmarks got sigficantly slower?

abhishek-iitmadras · 2025-02-11T06:24:05Z

So benchmarks got sigficantly slower?

Yes, after addressing these 9 CI's failures, I will proceed with optimization.

jan-wassenberg · 2025-05-05T17:42:31Z

numpy/_core/src/umath/loops_arithmetic.dispatch.cpp

+            const auto different_signs = hn::Xor(src_sign, scalar_sign);
+
+            auto adjustment = hn::And(different_signs, has_remainder);
+            vec_div = hn::IfThenElse(adjustment, hn::Sub(vec_div, hn::Set(d, static_cast<T>(1))), vec_div);


Instead of IfThenElse(adjust, Sub(div, one), div) we can use MaskedSubOr(div, adjust, div, one). This would be faster on AVX-512.

jan-wassenberg · 2025-05-05T17:46:59Z

numpy/_core/src/umath/loops_arithmetic.dispatch.cpp

+            const auto has_remainder = hn::Ne(vec_src, vec_mul);
+            const auto src_sign = hn::Lt(vec_src, vec_zero);
+            const auto scalar_sign = hn::Lt(vec_scalar, vec_zero); 
+            const auto different_signs = hn::Xor(src_sign, scalar_sign);


Instead of Lt, Lt, Xor, we can Xor src and scalar, and then test only that one sign bit.

jan-wassenberg · 2025-05-05T17:48:11Z

numpy/_core/src/umath/loops_arithmetic.dispatch.cpp

+            const auto vec_src = hn::LoadU(d, src + i);
+            auto vec_div = hn::Div(vec_src, vec_scalar);
+            const auto vec_mul = hn::Mul(vec_div, vec_scalar);
+            const auto has_remainder = hn::Ne(vec_src, vec_mul);


Instead of And(diff, Ne(src, mul)) consider AndNot(Eq(src, mul), diff) - bit faster on x86.

Done
thanks @jan-wassenberg

rageshhajela16 · 2025-06-02T07:09:35Z

@jan-wassenberg @r-devulap With latest changes from @abhishek-iitmadras can you please help to confirm if we should incorporate any more changes in scope of this current contribution? Thanks

r-devulap · 2025-06-03T17:20:33Z

@jan-wassenberg @r-devulap With latest changes from @abhishek-iitmadras can you please help to confirm if we should incorporate any more changes in scope of this current contribution? Thanks

Will find some time to review it this week.

Copilot

Pull Request Overview

This PR replaces the existing C universal intrinsics implementation for arithmetic loops with a C++ version that leverages Google Highway for SIMD acceleration, and updates the build system accordingly.

Introduces loops_arithmetic.dispatch.cpp with SIMD-optimized division kernels using Highway.
Updates meson.build to compile the new C++ source instead of the old .c.src.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
numpy/_core/src/umath/loops_arithmetic.dispatch.cpp	New C++ SIMD dispatch implementation replacing the old C source
numpy/_core/meson.build	Updated build script to reference the new `.cpp` file

Comments suppressed due to low confidence (3)

numpy/_core/src/umath/loops_arithmetic.dispatch.cpp:6

The macro NPY__CPU_TARGET_INDEX may contain an extra underscore. Verify that this matches the definition in npy_cpu_dispatch.h (likely NPY_CPU_TARGET_INDEX) so that HWY_COMPILE_ONLY_SCALAR is enabled correctly.

#if NPY__CPU_TARGET_INDEX == 0

numpy/_core/src/umath/loops_arithmetic.dispatch.cpp:52

The call to std::fill requires . Please add #include <algorithm> at the top of the file to ensure std::fill is declared.

std::fill(dst, dst + len, static_cast<T>(0));

numpy/_core/src/umath/loops_arithmetic.dispatch.cpp:22

[nitpick] <cstdio> is not used in this file. Consider removing this include to reduce unnecessary dependencies.

#include <cstdio>

rageshhajela16 · 2025-06-20T05:06:39Z

@jan-wassenberg @r-devulap With latest changes from @abhishek-iitmadras can you please help to confirm if we should incorporate any more changes in scope of this current contribution? Thanks

Will find some time to review it this week.

Hi @r-devulap The current two CI test failures and the minor warnings flagged by the Copilot review don’t appear to be related to the changes introduced in this patch. Could you kindly help reconfirm if there are any further suggestions to incorporate? Thanks

r-devulap

Apologies for the delay in reviewing. Am I measuring this right? I see large regressions on AVX-512.

| Change | Before [7c91551b] <main> | After [bd091b84] <27402> | Ratio | Benchmark (Parameter)                                                                         |
|--------+--------------------------+--------------------------+-------+-----------------------------------------------------------------------------------------------|
| +      | 3.94±0.03μs              | 39.4±0.4μs               | 9.98  | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)       |
| +      | 3.13±0.04μs              | 16.8±0.7μs               | 5.37  | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint32'>, 8)     |
| +      | 3.14±0.01μs              | 16.2±0.4μs               | 5.17  | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint32'>, 43)    |
| +      | 2.23±0.04μs              | 64.6±0.3μs               | 28.97 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)         |
| +      | 2.22±0.03μs              | 63.8±0.6μs               | 28.79 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)       |
| +      | 2.22±0.03μs              | 63.8±0.3μs               | 28.7  | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)        |
| +      | 2.22±0.02μs              | 63.8±0.6μs               | 28.68 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)        |
| +      | 2.44±0.04μs              | 62.8±6μs                 | 25.72 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)        |
| +      | 2.45±0.03μs              | 55.4±7μs                 | 22.62 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)       |
| +      | 6.00±0.01μs              | 126±1μs                  | 21.04 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint64'>, 8)     |
| +      | 6.02±0.02μs              | 125±3μs                  | 20.8  | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.ulonglong'>, 8)  |
| +      | 2.43±0.03μs              | 48.9±7μs                 | 20.14 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)      |
| +      | 1.79±0.01μs              | 3.71±0.01μs              | 2.07  | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint8'>, 43)     |
| +      | 1.80±0.04μs              | 3.72±0.02μs              | 2.07  | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint8'>, 8)      |
| +      | 5.99±0.02μs              | 119±4μs                  | 19.91 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint64'>, 43)    |
| +      | 2.44±0.05μs              | 48.3±0.4μs               | 19.81 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)       |
| +      | 5.99±0.02μs              | 114±3μs                  | 19.05 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.ulonglong'>, 43) |
| +      | 8.97±0.02μs              | 93.0±2μs                 | 10.37 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, -43)   |
| +      | 9.00±0.02μs              | 92.9±1μs                 | 10.33 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)        |
| +      | 8.98±0.01μs              | 92.5±2μs                 | 10.29 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)       |
| +      | 8.97±0.01μs              | 91.7±2μs                 | 10.22 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, -8)    |
| +      | 9.06±0.05μs              | 92.4±3μs                 | 10.21 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)       |
| +      | 8.99±0.03μs              | 91.5±3μs                 | 10.18 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)      |
| +      | 9.00±0.03μs              | 91.6±2μs                 | 10.18 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, 43)    |
| +      | 8.97±0.03μs              | 91.0±2μs                 | 10.14 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.longlong'>, 8)     |
| +      | 3.91±0.03μs              | 39.4±0.3μs               | 10.08 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)        |
| +      | 3.92±0.02μs              | 39.5±0.3μs               | 10.07 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)       |
| +      | 3.92±0.02μs              | 39.3±0.1μs               | 10.05 | bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)      |
| +      | 2.12±0.04μs              | 3.66±0.01μs              | 1.72  | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint16'>, 8)     |
| +      | 2.13±0.03μs              | 3.66±0.01μs              | 1.71  | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint16'>, 43)    |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

r-devulap · 2025-06-20T16:55:37Z

numpy/_core/src/umath/loops_arithmetic.dispatch.cpp

+
+#include "npy_cpu_dispatch.h"
+#if NPY__CPU_TARGET_INDEX == 0
+#define HWY_COMPILE_ONLY_SCALAR 1


@seiko2plus was explicit on not relying on HWY SCALAR operations for baseline targets. We might want to use NumPy scalar loops for baseline. See

numpy/numpy/_core/src/common/simd/simd.hpp

Lines 24 to 38 in 7c91551

* We avoid using Highway scalar operations for the following reasons:

*

* 1. NumPy already provides optimized kernels for scalar operations. Using these

* existing implementations is more consistent with NumPy's architecture and

* allows for compiler optimizations specific to standard library calls.

*

* 2. Not all Highway intrinsics are fully supported in scalar mode, which could

* lead to compilation errors or unexpected behavior for certain operations.

*

* 3. For NumPy's strict IEEE 754 floating-point compliance requirements, direct scalar

* implementations offer more predictable behavior than EMU128.

*

* Therefore, we only enable Highway SIMD when targeting actual SIMD instruction sets.

*/

#define NPY_HWY ((HWY_TARGET != HWY_SCALAR) && (HWY_TARGET != HWY_EMU128))

r-devulap · 2025-06-20T16:56:41Z

numpy/_core/src/umath/loops_arithmetic.dispatch.cpp

+        npy_set_floatstatus_divbyzero();
+    }
+}
+#if NPY_SIMD


Use new macro NPY_HWY:

numpy/numpy/_core/src/common/simd/simd.hpp

Line 38 in 7c91551

#define NPY_HWY ((HWY_TARGET != HWY_SCALAR) && (HWY_TARGET != HWY_EMU128))

r-devulap · 2025-06-20T16:58:33Z

numpy/_core/src/umath/loops_arithmetic.dispatch.cpp

+        *reinterpret_cast<T*>(iop1) = io1;
+        return;
+    }
+#if NPY_SIMD   


Mousius added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Sep 16, 2024

abhishek-iitmadras changed the title ~~Convert arithmetic from C universal intrinsics to C++ using Highway~~ ENH: Convert arithmetic from C universal intrinsics to C++ using Highway Sep 16, 2024

r-devulap reviewed Sep 26, 2024

View reviewed changes

abhishek-iitmadras force-pushed the HWY2 branch from 527a2f1 to 681c51a Compare October 7, 2024 14:16

r-devulap added the HWY features related to google Highway label Oct 7, 2024

jan-wassenberg reviewed Oct 7, 2024

View reviewed changes

r-devulap mentioned this pull request Oct 10, 2024

ENH: Speed up umath functions using NEON/SVE | SIMD #27533

Open

abhishek-iitmadras commented Oct 20, 2024

View reviewed changes

abhishek-iitmadras force-pushed the HWY2 branch from 681c51a to a94d62a Compare October 20, 2024 18:51

abhishek-iitmadras marked this pull request as ready for review October 20, 2024 19:05

abhishek-iitmadras force-pushed the HWY2 branch from a94d62a to e30b14c Compare October 20, 2024 21:14

abhishek-iitmadras force-pushed the HWY2 branch from e30b14c to 9bebb5f Compare October 29, 2024 08:19

r-devulap mentioned this pull request Nov 18, 2024

Loongarch: modify lsx optimization(25215PR) for newest branch, and add Qemu tests #27662

Closed

abhishek-iitmadras force-pushed the HWY2 branch from 9bebb5f to 6d52554 Compare December 12, 2024 07:04

abhishek-iitmadras force-pushed the HWY2 branch from 3b537e4 to e2c93c7 Compare January 2, 2025 20:16

abhishek-iitmadras force-pushed the HWY2 branch 2 times, most recently from dcf8087 to 364d295 Compare January 11, 2025 18:12

abhishek-iitmadras force-pushed the HWY2 branch 2 times, most recently from cc9e074 to 795cc9b Compare February 19, 2025 05:34

abhishek-iitmadras force-pushed the HWY2 branch from c3ecde7 to 4ac89ba Compare April 14, 2025 17:04

abhishek-iitmadras force-pushed the HWY2 branch 5 times, most recently from 5fdd529 to 8699aa9 Compare May 5, 2025 11:54

jan-wassenberg reviewed May 5, 2025

View reviewed changes

abhishek-iitmadras force-pushed the HWY2 branch 5 times, most recently from 7bca10a to 617e493 Compare May 9, 2025 05:55

abhishek-iitmadras force-pushed the HWY2 branch from 617e493 to dfdede7 Compare May 14, 2025 09:18

r-devulap requested a review from Copilot June 3, 2025 17:22

Copilot AI reviewed Jun 3, 2025

View reviewed changes

abhishek-iitmadras force-pushed the HWY2 branch 2 times, most recently from cce769e to f2ad05e Compare June 8, 2025 17:07

abhishek-iitmadras added 6 commits June 13, 2025 14:00

Convert arithmetic from C universal intrinsics to C++ using Highway

4f17ac7

change dispatch logic

0198a2f

optimise further

298f92c

add NPY_SIMD flag

7d571a1

fix

073bd8b

Add support for RVV

c15d6fa

abhishek-iitmadras force-pushed the HWY2 branch from f2ad05e to c15d6fa Compare June 13, 2025 08:30

r-devulap requested changes Jun 20, 2025

View reviewed changes

	* We avoid using Highway scalar operations for the following reasons:
	*
	* 1. NumPy already provides optimized kernels for scalar operations. Using these
	* existing implementations is more consistent with NumPy's architecture and
	* allows for compiler optimizations specific to standard library calls.
	*
	* 2. Not all Highway intrinsics are fully supported in scalar mode, which could
	* lead to compilation errors or unexpected behavior for certain operations.
	*
	* 3. For NumPy's strict IEEE 754 floating-point compliance requirements, direct scalar
	* implementations offer more predictable behavior than EMU128.
	*
	* Therefore, we only enable Highway SIMD when targeting actual SIMD instruction sets.
	*/
	#define NPY_HWY ((HWY_TARGET != HWY_SCALAR) && (HWY_TARGET != HWY_EMU128))

Uh oh!

ENH: Convert arithmetic from C universal intrinsics to C++ using Highway #27402

Are you sure you want to change the base?

ENH: Convert arithmetic from C universal intrinsics to C++ using Highway #27402

Conversation

abhishek-iitmadras commented Sep 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhishek-iitmadras Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r-devulap commented Oct 4, 2024

Uh oh!

abhishek-iitmadras commented Oct 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jan-wassenberg commented Oct 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seiko2plus commented Oct 22, 2024

Uh oh!

jan-wassenberg commented Oct 22, 2024

Uh oh!

r-devulap commented Oct 22, 2024

Uh oh!

abhishek-iitmadras commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jan-wassenberg commented Feb 3, 2025

Uh oh!

abhishek-iitmadras commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip commented Feb 11, 2025

Uh oh!

abhishek-iitmadras commented Feb 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rageshhajela16 commented Jun 2, 2025

Uh oh!

r-devulap commented Jun 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

rageshhajela16 commented Jun 20, 2025

Uh oh!

r-devulap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhishek-iitmadras commented Sep 16, 2024 •

edited

Loading

abhishek-iitmadras Sep 27, 2024 •

edited

Loading

abhishek-iitmadras commented Jan 2, 2025 •

edited

Loading

abhishek-iitmadras commented Feb 10, 2025 •

edited

Loading