Adds support for accelerated sorting with x86-simd-sort #127936

sterrettm2 · 2024-06-04T18:33:18Z

Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.

For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.

Contiguous Benchmarks

float32, normally distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512 
16             7.150844336    6.886271477    7.132277489    1.038420335    1.002603214    
128            9.208030939    8.478154898    7.846915245    1.086089019    1.173458697    
1024           37.79037627    23.60707456    16.44122627    1.600807257    2.298513241    
10000          714.7355628    203.9921844    105.5683001    3.503739934    6.770361577    
100000         8383.074408    721.6333354    465.3709247    11.61680593    18.01374766    
1000000        97124.31945    5632.054572    3920.148401    17.24491803    24.77567416    
10000000       1161974.907    86070.48988    71533.82301    13.50027063    16.24371323
    
int32_t, uniformly distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512 
16             7.203208685    6.92212224     7.014458179    1.040606975    1.026908779    
128            8.972388983    8.195516348    7.592543125    1.094792396    1.18173698     
1024           32.77489477    23.6874548     15.36617105    1.383639359    2.132925285    
10000          607.8824128    193.3402024    99.25090471    3.144107667    6.124703997    
100000         523.9384684    608.1836536    442.3166784    0.861480682    1.184532472    
1000000        5211.348627    5271.598405    3518.861883    0.988570871    1.480975611    
10000000       133853.6263    81463.05084    67852.97394    1.643120714    1.972700952

Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.

Discontiguous Benchmarks

float, normal distributed, discontiguous in sorted dimension (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512 
16             3.836543679    4.011214256    3.84376061     0.956454439    0.99812243     
128            5.755310194    5.755723127    4.820394962    0.999928257    1.193949923    
1024           49.46946019    24.78790785    15.47874362    1.995709379    3.195960952    
10000          665.2505291    236.6165959    143.9490662    2.811512551    4.621429974    
100000         4328.002203    1329.001212    818.3516414    3.256582586    5.288682743    
1000000        47651.5018     16693.72045    11827.39551    2.854456677    4.028909133    
10000000       556655.1288    236252.6258    184215.9828    2.356185998    3.021752621   
 
int32_t, uniformly distributed, discontiguous in sorted dimension  (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512 
16             3.817994356    3.878117442    3.770039797    0.984496837    1.012719908    
128            5.578731397    5.577152082    4.716770534    1.000283176    1.182743862    
1024           43.3412619     23.61275801    14.55446819    1.835501887    2.977866408    
10000          634.3997478    224.4322851    133.9518324    2.826686667    4.736028889    
100000         4084.358152    1292.363303    781.7867576    3.16037924     5.22438902     
1000000        46262.20465    16608.35284    11367.51817    2.785478192    4.06968381     
10000000       541231.9104    235185.1861    180249.9294    2.301301028    3.002674742

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @peterbell10

pytorch-bot · 2024-06-04T18:33:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127936

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit faddfa4 with merge base 6fc63b4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2024-06-04T18:33:22Z

The committers listed above are authorized under a signed CLA.

✅ login: sterrettm2 (8cc6cc9, 354b43a, 9619934, 4d0fb57, 019d41a, faddfa4)

jgong5 · 2024-06-07T10:05:45Z

May I know if the benchmark was collected with single core or multi-core? Can you share the result for both configurations?

sterrettm2 · 2024-06-14T20:49:33Z

May I know if the benchmark was collected with single core or multi-core? Can you share the result for both configurations?

This was on a Skylake machine (7900x), set to 8 cores. Here's the equivalent benchmarks for single core:

Single Core Performance


float32, normally distributed  (in microseconds)
size           default        avx2           avx512         Default/AVX2   Default/AVX512
16             7.113132954    7.125889063    6.855771542    0.998209892    1.03753938     
128            9.120340586    8.584395647    7.56901145     1.06243246     1.204957959    
1024           36.27155249    24.53012899    15.79697341    1.478653149    2.296107714    
10000          711.9155329    200.382199     108.2926268    3.552788305    6.573998194    
100000         8399.78071     2366.537676    1330.463447    3.54939657     6.313424639    
1000000        100915.9743    28517.82126    17892.53366    3.538698604    5.640116497    
10000000       1204376.316    372791.338     258797.0257    3.230698231    4.653748678    

int32_t, uniformly distributed  (in microseconds)
size           default        avx2           avx512         Default/AVX2   Default/AVX512
16             6.839853764    6.9264884      6.681355715    0.987492272    1.023722438    
128            8.356203556    8.445468426    7.25971818     0.989430442    1.151036907    
1024           30.88020962    23.73411948    14.40595382    1.30108933     2.143572721    
10000          598.6316072    191.3458307    99.9496872     3.128532276    5.989329471    
100000         1971.655619    2248.225125    1253.185778    0.87698318     1.57331471     
1000000        24533.7907     27625.80853    16539.86351    0.888075029    1.483312766    
10000000       361025.8579    358125.9727    248421.4783    1.008097389    1.453279565    
                                                                                                                                                       
                                                                                          
float, normal distributed discontiguous in sorted dimension (in microseconds)
size           default        avx2           avx512         Default/AVX2   Default/AVX512
16             3.9883219      3.897530437    3.803153276    1.023294613    1.048688183    
128            5.797074333    5.687333666    4.795829393    1.019295627    1.208774095    
1024           49.77498938    25.21366438    16.05679234    1.974127546    3.099933556    
10000          670.7694155    244.0156184    145.6871839    2.748879026    4.604175863    
100000         8045.512319    2731.892052    1707.214788    2.945033027    4.712653836    
1000000        96954.93258    32101.35607    21151.68938    3.020275292    4.583791433    
10000000       1159710.248    427844.882     316131.2342    2.710585769    3.668445642    

int32_t, uniformly distributed discontiguous in sorted dimension  (in microseconds)
size           default        avx2           avx512         Default/AVX2   Default/AVX512
16             3.780948997    3.872428179    3.718787193    0.97637679     1.016715612    
128            5.341575543    5.529783332    4.779936273    0.965964708    1.117499322    
1024           39.1874838     23.01476824    15.89414877    1.702710337    2.465528942    
10000          555.9280075    225.5575979    137.2813291    2.464683135    4.049552922    
100000         6663.585735    2620.158211    1609.420934    2.543199761    4.140362284    
1000000        79281.4539     31679.51566    20372.97304    2.502609407    3.891501439    
10000000       961423.1586    417279.1243    305512.3885    2.304028893    3.146920369

jgong5

The perf numbers LGTM. Seems CI failures are related. Could you please fix them first?

sterrettm2 · 2024-07-03T19:03:22Z

The perf numbers LGTM. Seems CI failures are related. Could you please fix them first?

I've spent a while looking at the test failure, and I think the test itself is wrong.

The input array has some duplicate elements.
The test is then comparing the indices from the argsort for exact equality, but isn't using stable=True.
So it then uses x86-simd-sort, which isn't stable and which orders identical elements in a different way compared to the current implementation.
This causes the test to fail, even though both argsorts seem to give valid results.
To test this, I tried reordering the input array using the indices from both the reference sort and x86-simd-sort, and the resulting arrays match exactly.

I assume it is okay for these indices to not match exactly, since stable is set to false in this test.
I modified the test to handle testing sort/argsort differently; I'm not super familiar with this testing code, so I'm not sure if this change is reasonable

jgong5

Basically we need a speical way of checking result equality. Instead of adding the special is_sort flag here, would it be more general to pass a functor for checking the equality? Then, you don't have to check the is_sort and assumes the result type in the general code.

r-devulap · 2024-07-04T04:49:49Z

aten/src/ATen/native/cpu/SortingKernel.cpp

+  if (stable) return false;
+
+  auto type = values.scalar_type();
+  if (not (type == ScalarType::Long || type == ScalarType::Int || type == ScalarType::Double || type == ScalarType::Float)) return false;


Some of the win CI failures:

C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cpu/SortingKernel.cpp(168): error C3861: 'not': identifier not found

r-devulap · 2024-07-04T04:53:37Z

third_party/x86-simd-sort

Could you update x86-simd-sort to the latest commit? This version has a build problem on openBSD. See intel/x86-simd-sort#157.

sterrettm2 · 2024-07-11T23:58:56Z

I've made the changes suggested by Raghuveer, and I changed the previous fix for the test. Now it passes a comparator function to determine how to check for equality, does this seem good to you? Let me know if you want further changes, I'm not sure what best fits Pytorch's testing code.

jgong5 · 2024-07-12T00:35:18Z

test/inductor/test_torchinductor.py

@@ -363,12 +363,14 @@ def check_model(
    assert_equal=True,
    check_gradient=False,
    check_has_compiled=True,
+    comparator_assert=None,


Perhaps equal_checker is more explicit?

jgong5 · 2024-07-12T01:05:12Z

test/inductor/test_torchinductor.py

            exact_dtype=exact_dtype,
        )
+
        # In case of input mutations, check that inputs are the same
        self.assertEqual(


Should you move this into the default checker too?

I think the input mutation check should be ran regardless of which checker is being used, so to avoid duplicating it I moved it outside of the checker functions to avoid duplicating it. Would you prefer it be moved into the comparison functions to simplify that part?

It also applied to sort_equal_checker? I saw you have special logic for input checks there?

jgong5 · 2024-07-12T01:06:10Z

test/inductor/test_torchinductor_opinfo.py

@@ -319,6 +319,78 @@ def wrapper_noop_set_seed(op, *args, **kwargs):
 )


+def default_comparator_assert(


Better to move this into check_model function and assign the equal_check to this default function if equal_check is None.

sterrettm2 · 2024-07-18T17:39:08Z

I've made those modifications, except for moving the input checker into the default comparator. I think that should be ran regardless of which comparator is used, so I left it out to avoid duplicating it in each comparator.

sterrettm2 · 2024-08-01T16:55:02Z

Pinging @jgong5, I think my newest commit should fix the suggestions you had, could you take a look at it?

jgong5

You may re-request my review as a notification.

jgong5 · 2024-08-02T03:00:58Z

test/inductor/test_torchinductor.py

            exact_dtype=exact_dtype,
        )
+
        # In case of input mutations, check that inputs are the same
        self.assertEqual(


It also applied to sort_equal_checker? I saw you have special logic for input checks there?

sterrettm2 · 2024-08-09T19:26:57Z

equal_sort_checker does not have special logic for checking the inputs. I assume you are referring to this code:

self.assertEqual(
    ref_inputs[0][actual.indices],
    ref_inputs[0][correct.indices],
    atol=atol,
    rtol=rtol,
    equal_nan=True,
    exact_dtype=exact_dtype,
)

If that's the case, what this code is doing is checking that the actual and correct indices are 'argsort equivalent', in the sense that if we reorder the original inputs with the indices from the argsort, we get identical results. Currently the only thing checking the inputs is that shared check that the inputs have not been modified

jgong5 · 2024-08-23T03:18:05Z

@sterrettm2 Please resolve the conflict. Also, you might have to get CLA assigned...

sterrettm2 · 2024-08-26T23:35:02Z

I've got the CLA figured out, and rebased on the current branch. I think it should all be good, once the tests have been run again

sterrettm2 · 2024-08-28T20:02:09Z

Hopefully that should have resolved all of the issues with the CI, could you rerun it?

peterbell10 · 2024-09-03T18:38:08Z

aten/src/ATen/native/cpu/SortingKernel.cpp

+        }
+      }else{
+        std::vector<scalar_t> tmp_values(dim_size);
+        std::vector<index_t> tmp_indices(dim_size);


std::vector zero-fills, would be better to use c10::SmallBuffer

peterbell10 · 2024-09-03T18:49:11Z

test/inductor/test_torchinductor_opinfo.py

+    if not is_scalar:
+        self.assertEqual(
+            ref_inputs[0][actual],
+            ref_inputs[0][correct],


This is too loose for stable sort. Also can you explain why an equality check fails? What is different between inductor and eager that causes the returned indices to be different?

That comparison definitely isn't right for stable sorts, I agree. Are any sorts with stable=True being tested by that code? I wasn't able to tell.
A simple equality check of the indices can fail if there are repeated values, since x86-simd-sort might give the indices of repeats in a different order compared to the reference. I'm not familiar with the inductor code, but from my inspecting it seems to use this Triton kernel

pytorch/torch/_inductor/runtime/triton_helpers.py

Line 506 in 873abfc

def sort_with_index(

A simple equality check of the indices can fail if there are repeated values, since x86-simd-sort might give the indices of repeats in a different order compared to the reference.

Triton is only used for CUDA, on CPU we should be comparing at::sort to at::sort here so I would expect the indices of repeated values to be the same. Otherwise it suggests the sort is non-deterministic.

Okay, I've determined why this seemed so strange. In the check_model code, reference_in_float defaults to True, so it was converting the fp16 values to fp32 to generate the reference results; XSS sort can sort fp32 values, but not fp16, so the result was that the fp16 values were sorted using the normal sorting method, but the fp32 values were sorted using XSS, which sort equivalent values differently, causing the fp16 test failures seen before. I'll change reference_in_float to False in my next patch

peterbell10 · 2024-11-01T22:08:41Z

@pytorchbot merge

pytorchmergebot · 2024-11-01T22:10:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads. <details> <summary><b>Contiguous Benchmarks</b></summary> ``` float32, normally distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214 128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697 1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241 10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577 100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766 1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416 10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323 int32_t, uniformly distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779 128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698 1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285 10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997 100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472 1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611 10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952 ``` </details> Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction. <details> <summary><b>Discontiguous Benchmarks</b></summary> ``` float, normal distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243 128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923 1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952 10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974 100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743 1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133 10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621 int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908 128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862 1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408 10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889 100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902 1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381 10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742 ``` </details> Pull Request resolved: pytorch#127936 Approved by: https://github.com/jgong5, https://github.com/peterbell10, https://github.com/sanchitintel

…rch#127936)" This reverts commit 7e65060.

atalman · 2024-12-02T16:44:23Z

@pytorchmergebot revert -c nosignal -m "Looks like its the cause of the std::bad_alloc failure"

pytorchmergebot · 2024-12-02T16:46:07Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2024-12-02T16:46:15Z

Reverting PR 127936 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit 7e650604105ee74529f6352f9a81cc68f32c2d56 returned non-zero exit code 1

Auto-merging aten/src/ATen/native/cpu/SortingKernel.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/native/cpu/SortingKernel.cpp
Auto-merging cmake/Dependencies.cmake
Auto-merging cmake/Summary.cmake
Auto-merging test/inductor/test_torchinductor_opinfo.py
error: could not revert 7e650604105... Adds support for accelerated sorting with x86-simd-sort (#127936)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

…rch#127936)" This reverts commit 7e65060.

…d-sort (#127936) (#141901) Looks like the original PR caused: #140590 Please see comment: #140590 (comment) Pull Request resolved: #141901 Approved by: https://github.com/andrewor14, https://github.com/malfet

sanchitintel · 2024-12-03T00:53:42Z

Hi @sterrettm2, what are the memory requirements of this technique? Thanks

…d-sort (pytorch#127936) (pytorch#141901) Looks like the original PR caused: pytorch#140590 Please see comment: pytorch#140590 (comment) Pull Request resolved: pytorch#141901 Approved by: https://github.com/andrewor14, https://github.com/malfet

sterrettm2 · 2025-01-22T18:07:59Z

The amount of memory it uses shouldn't be particularly high; depending on the situation. it either doesn't allocate, or allocates two temporary buffers, both the length of the dimension being sorted. I'm still trying to replicate this failure locally, so I haven't been able to figure out what's going on yet

leslie-fang-intel · 2025-02-17T06:55:21Z

The amount of memory it uses shouldn't be particularly high; depending on the situation. it either doesn't allocate, or allocates two temporary buffers, both the length of the dimension being sorted. I'm still trying to replicate this failure locally, so I haven't been able to figure out what's going on yet

Hi @sterrettm2, may I know the status of this PR? Did you replicate this failure?

sterrettm2 · 2025-02-18T19:55:11Z

I have finally managed to replicate the failure locally, and have begun working on debugging it. Hopefully I should have more news relatively soon.

mingfeima · 2025-02-19T01:40:59Z

@sterrettm2 could you please ping me in teams internally, my email is [email protected]. Thanks! We need to track the status on this issue: #140590

sterrettm2 · 2025-02-26T22:42:59Z

Hello,
My new patch updates x86-simd-sort which should fix the std::bad_alloc crashes that were observed previously.

As far as I can tell, these crashes are caused by a compiler bug in certain GCC versions, where it generates MMX instructions without the EMMS instruction.
It seems that variants of this bug have appeared multiple times in GCC (bugs 117926, 91533, and 80799 to list a few).
As a workaround, we simply add the EMMS intrinsic to the end of the functions if MMX is enabled.

I am quite confident this fixes the issue, since the std::bad_alloc occurs in the function std::__detail::_Hashtable_alloc, which is exactly the same as the reproduction I have for the MMX/EMMS issue in GCC.

Pytorch bad_alloc trace

#0  __cxxabiv1::__cxa_throw (obj=0x55555f9292e0, tinfo=0x7ffff3beefb8 <typeinfo for std::bad_alloc>,
    dest=0x7ffff3ad3020 <std::bad_alloc::~bad_alloc()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:80
#1  0x00007ffff3ace38a in std::__throw_bad_alloc () at ../../../../../libstdc++-v3/src/c++11/functexcept.cc:54
#2  0x00007fff4c72dcd6 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<stddex const, std::vector<bool (*)(_object*, void*&), std::allocator<bool (*)(_object*, void*&)> > >, false> > >::_e_buckets(unsigned long) [clone .isra.0] ()
   from /home/sterrettm/miniforge3/envs/build_binary/lib/python3.13/site-packages/torch/lib/libtorch_python.so
#3  0x00007fff4c784f8f in std::_Hashtable<std::basic_string<char, std::char_traits<char>, std::allocator<char> >ir<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, pybind11::object>, std::allocatair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, pybind11::object> >, std::__delect1st, std::equal_to<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::ing<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Dnged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_rehash long, unsigned long const&) ()

Reproduction bad_alloc trace

#0  0x00007ffff7cbb35a in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007ffff7ca90db in std::__throw_bad_alloc() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x0000555555555476 in __gnu_cxx::new_allocator<std::__detail::_Hash_node_base*>::allocate (
    this=<synthetic pointer>, __n=18446744073709551557) at /usr/include/c++/9/ext/new_allocator.h:105
#3  std::allocator_traits<std::allocator<std::__detail::_Hash_node_base*> >::allocate (__a=<synthetic pointer>...,
    __n=18446744073709551557) at /usr/include/c++/9/bits/alloc_traits.h:443
#4  std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<unsigned long const, unsigned long>, false> > >::_M_allocate_buckets (this=0x7fffffffdef0, __n=18446744073709551557)
    at /usr/include/c++/9/bits/hashtable_policy.h:2134

@mingfeima @leslie-fang-intel @sanchitintel The issue should be fixed now, are you able to reopen this PR? Or would it be better for me to just make a new pull request for this patch?

leslie-fang-intel · 2025-02-27T01:53:27Z

Hi @sterrettm2, thanks for the detail analysis. I also would like to understand more about why this issue only happens in Torch Library but not found with Torch Core CI. It may help us preventing similar issue happens again. Do you have any clue?

@mingfeima @leslie-fang-intel @sanchitintel The issue should be fixed now, are you able to reopen this PR? Or would it be better for me to just make a new pull request for this patch?

I am not able to re-open this PR, cc @atalman for whether a new PR is needed here.

sterrettm2 · 2025-02-28T17:21:10Z

So I'm not certain why this problem appears in the nightly builds but not the CI. I've actually not been able to find any version of GCC that generates the pxor instructions while building my patched version of Pytorch, yet the libtorch_cpu.so in the nightly wheel does have pxor without emms, causing the bad_alloc. Is it known exactly what compiler version is being used to generate the nightly wheels, and in particular which compiler was being used at the time of this issue?

leslie-fang-intel · 2025-03-01T01:59:11Z

Is it known exactly what compiler version is being used to generate the nightly wheels, and in particular which compiler was being used at the time of this issue?

Hi @atalman, could you help with this question or any comments on how to add a test case in PyTorch core to guard against this failure?

sterrettm2 · 2025-03-04T21:48:25Z

Okay, the compiler seems to be GCC: (GNU) 9.3.1 20200408 (Red Hat 9.3.1-2) from inspecting the built files in the wheel. Trying with GCC 9.3.0 built from source seems to show correct code generation though (and no use of pxor), so I wonder if there is some strange configuration on the systems building the wheels.

r-devulap · 2025-03-04T22:09:58Z

@leslie-fang-intel we have seen these kind of problems in NumPy too (see numpy/numpy@b3eed14 for example). I would recommend adding -fno-mmx to build flag if you can.

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jun 4, 2024

pytorchbot added the open source label Jun 4, 2024

janeyx99 requested a review from jgong5 June 6, 2024 17:21

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 6, 2024

jgong5 requested a review from mingfeima June 7, 2024 10:02

jgong5 requested changes Jun 17, 2024

View reviewed changes

pytorch-bot bot added the module: inductor label Jul 3, 2024

jgong5 requested changes Jul 4, 2024

View reviewed changes

r-devulap reviewed Jul 4, 2024

View reviewed changes

jgong5 requested changes Jul 12, 2024

View reviewed changes

jgong5 reviewed Aug 2, 2024

View reviewed changes

sterrettm2 requested a review from jgong5 August 22, 2024 16:42

jgong5 previously approved these changes Aug 23, 2024

View reviewed changes

sterrettm2 force-pushed the xss_sorting branch from 0bf8bdf to 98fbc70 Compare August 26, 2024 22:19

sterrettm2 requested a review from jgong5 August 27, 2024 17:38

jgong5 requested a review from peterbell10 August 28, 2024 02:24

jgong5 added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 29, 2024

peterbell10 reviewed Sep 3, 2024

View reviewed changes

pytorchmergebot added the merging label Nov 1, 2024

pytorchmergebot closed this in 7e65060 Nov 2, 2024

pytorchmergebot removed the merging label Nov 2, 2024

atalman added a commit to atalman/pytorch that referenced this pull request Nov 29, 2024

Revert "Adds support for accelerated sorting with x86-simd-sort (pyto…

0341742

…rch#127936)" This reverts commit 7e65060.

sanchitintel mentioned this pull request Nov 29, 2024

Revert - Adds support for accelerated sorting with x86-simd-sort #141782

Closed

atalman mentioned this pull request Nov 29, 2024

☂️ Many ecosystem libraries started to fail with std::bad_alloc after Nov 1st #140590

Closed

atalman added a commit to atalman/pytorch that referenced this pull request Dec 2, 2024

Revert "Adds support for accelerated sorting with x86-simd-sort (pyto…

9d7cf87

…rch#127936)" This reverts commit 7e65060.

atalman mentioned this pull request Dec 2, 2024

Revert "Adds support for accelerated sorting with x86-simd-sort (#127936) #141900

Closed

atalman added a commit to atalman/pytorch that referenced this pull request Dec 2, 2024

Revert "Adds support for accelerated sorting with x86-simd-sort (pyto…

4980dcf

…rch#127936)" This reverts commit 7e65060.

sterrettm2 mentioned this pull request Mar 17, 2025

Add x86-simd-sort accelerated sorting #149362

Open

		@@ -319,6 +319,78 @@ def wrapper_noop_set_seed(op, args, *kwargs):
		)


		def default_comparator_assert(

Adds support for accelerated sorting with x86-simd-sort #127936

Adds support for accelerated sorting with x86-simd-sort #127936

Uh oh!

Conversation

sterrettm2 commented Jun 4, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127936

✅ No Failures

Uh oh!

linux-foundation-easycla bot commented Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgong5 commented Jun 7, 2024

Uh oh!

sterrettm2 commented Jun 14, 2024

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

sterrettm2 commented Jul 3, 2024

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sterrettm2 commented Jul 11, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sterrettm2 commented Jul 18, 2024

Uh oh!

sterrettm2 commented Aug 1, 2024

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sterrettm2 commented Aug 9, 2024

Uh oh!

jgong5 commented Aug 23, 2024

Uh oh!

sterrettm2 commented Aug 26, 2024

Uh oh!

sterrettm2 commented Aug 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterbell10 Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sterrettm2 Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterbell10 commented Nov 1, 2024

Uh oh!

pytorchmergebot commented Nov 1, 2024

sterrettm2 commented Jun 4, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 4, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jun 4, 2024 •

edited

Loading

peterbell10 Sep 3, 2024 •

edited

Loading

sterrettm2 Sep 6, 2024 •

edited

Loading

leslie-fang-intel commented Feb 17, 2025 •

edited

Loading