Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Adds support for accelerated sorting with x86-simd-sort #127936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

sterrettm2
Copy link
Contributor

@sterrettm2 sterrettm2 commented Jun 4, 2024

Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.

For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.

Contiguous Benchmarks
float32, normally distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512 
16             7.150844336    6.886271477    7.132277489    1.038420335    1.002603214    
128            9.208030939    8.478154898    7.846915245    1.086089019    1.173458697    
1024           37.79037627    23.60707456    16.44122627    1.600807257    2.298513241    
10000          714.7355628    203.9921844    105.5683001    3.503739934    6.770361577    
100000         8383.074408    721.6333354    465.3709247    11.61680593    18.01374766    
1000000        97124.31945    5632.054572    3920.148401    17.24491803    24.77567416    
10000000       1161974.907    86070.48988    71533.82301    13.50027063    16.24371323
    
int32_t, uniformly distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512 
16             7.203208685    6.92212224     7.014458179    1.040606975    1.026908779    
128            8.972388983    8.195516348    7.592543125    1.094792396    1.18173698     
1024           32.77489477    23.6874548     15.36617105    1.383639359    2.132925285    
10000          607.8824128    193.3402024    99.25090471    3.144107667    6.124703997    
100000         523.9384684    608.1836536    442.3166784    0.861480682    1.184532472    
1000000        5211.348627    5271.598405    3518.861883    0.988570871    1.480975611    
10000000       133853.6263    81463.05084    67852.97394    1.643120714    1.972700952 

Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.

Discontiguous Benchmarks
float, normal distributed, discontiguous in sorted dimension (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512 
16             3.836543679    4.011214256    3.84376061     0.956454439    0.99812243     
128            5.755310194    5.755723127    4.820394962    0.999928257    1.193949923    
1024           49.46946019    24.78790785    15.47874362    1.995709379    3.195960952    
10000          665.2505291    236.6165959    143.9490662    2.811512551    4.621429974    
100000         4328.002203    1329.001212    818.3516414    3.256582586    5.288682743    
1000000        47651.5018     16693.72045    11827.39551    2.854456677    4.028909133    
10000000       556655.1288    236252.6258    184215.9828    2.356185998    3.021752621   
 
int32_t, uniformly distributed, discontiguous in sorted dimension  (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512 
16             3.817994356    3.878117442    3.770039797    0.984496837    1.012719908    
128            5.578731397    5.577152082    4.716770534    1.000283176    1.182743862    
1024           43.3412619     23.61275801    14.55446819    1.835501887    2.977866408    
10000          634.3997478    224.4322851    133.9518324    2.826686667    4.736028889    
100000         4084.358152    1292.363303    781.7867576    3.16037924     5.22438902     
1000000        46262.20465    16608.35284    11367.51817    2.785478192    4.06968381     
10000000       541231.9104    235185.1861    180249.9294    2.301301028    3.002674742 

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @peterbell10

Copy link

pytorch-bot bot commented Jun 4, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127936

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit faddfa4 with merge base 6fc63b4 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

linux-foundation-easycla bot commented Jun 4, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jun 4, 2024
@janeyx99 janeyx99 requested a review from jgong5 June 6, 2024 17:21
@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 6, 2024
@jgong5 jgong5 requested a review from mingfeima June 7, 2024 10:02
@jgong5
Copy link
Collaborator

jgong5 commented Jun 7, 2024

May I know if the benchmark was collected with single core or multi-core? Can you share the result for both configurations?

@sterrettm2
Copy link
Contributor Author

May I know if the benchmark was collected with single core or multi-core? Can you share the result for both configurations?

This was on a Skylake machine (7900x), set to 8 cores. Here's the equivalent benchmarks for single core:

Single Core Performance

float32, normally distributed  (in microseconds)
size           default        avx2           avx512         Default/AVX2   Default/AVX512
16             7.113132954    7.125889063    6.855771542    0.998209892    1.03753938     
128            9.120340586    8.584395647    7.56901145     1.06243246     1.204957959    
1024           36.27155249    24.53012899    15.79697341    1.478653149    2.296107714    
10000          711.9155329    200.382199     108.2926268    3.552788305    6.573998194    
100000         8399.78071     2366.537676    1330.463447    3.54939657     6.313424639    
1000000        100915.9743    28517.82126    17892.53366    3.538698604    5.640116497    
10000000       1204376.316    372791.338     258797.0257    3.230698231    4.653748678    

int32_t, uniformly distributed  (in microseconds)
size           default        avx2           avx512         Default/AVX2   Default/AVX512
16             6.839853764    6.9264884      6.681355715    0.987492272    1.023722438    
128            8.356203556    8.445468426    7.25971818     0.989430442    1.151036907    
1024           30.88020962    23.73411948    14.40595382    1.30108933     2.143572721    
10000          598.6316072    191.3458307    99.9496872     3.128532276    5.989329471    
100000         1971.655619    2248.225125    1253.185778    0.87698318     1.57331471     
1000000        24533.7907     27625.80853    16539.86351    0.888075029    1.483312766    
10000000       361025.8579    358125.9727    248421.4783    1.008097389    1.453279565    
                                                                                                                                                       
                                                                                          
float, normal distributed discontiguous in sorted dimension (in microseconds)
size           default        avx2           avx512         Default/AVX2   Default/AVX512
16             3.9883219      3.897530437    3.803153276    1.023294613    1.048688183    
128            5.797074333    5.687333666    4.795829393    1.019295627    1.208774095    
1024           49.77498938    25.21366438    16.05679234    1.974127546    3.099933556    
10000          670.7694155    244.0156184    145.6871839    2.748879026    4.604175863    
100000         8045.512319    2731.892052    1707.214788    2.945033027    4.712653836    
1000000        96954.93258    32101.35607    21151.68938    3.020275292    4.583791433    
10000000       1159710.248    427844.882     316131.2342    2.710585769    3.668445642    

int32_t, uniformly distributed discontiguous in sorted dimension  (in microseconds)
size           default        avx2           avx512         Default/AVX2   Default/AVX512
16             3.780948997    3.872428179    3.718787193    0.97637679     1.016715612    
128            5.341575543    5.529783332    4.779936273    0.965964708    1.117499322    
1024           39.1874838     23.01476824    15.89414877    1.702710337    2.465528942    
10000          555.9280075    225.5575979    137.2813291    2.464683135    4.049552922    
100000         6663.585735    2620.158211    1609.420934    2.543199761    4.140362284    
1000000        79281.4539     31679.51566    20372.97304    2.502609407    3.891501439    
10000000       961423.1586    417279.1243    305512.3885    2.304028893    3.146920369

Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The perf numbers LGTM. Seems CI failures are related. Could you please fix them first?

@sterrettm2
Copy link
Contributor Author

The perf numbers LGTM. Seems CI failures are related. Could you please fix them first?

I've spent a while looking at the test failure, and I think the test itself is wrong.

The input array has some duplicate elements.
The test is then comparing the indices from the argsort for exact equality, but isn't using stable=True.
So it then uses x86-simd-sort, which isn't stable and which orders identical elements in a different way compared to the current implementation.
This causes the test to fail, even though both argsorts seem to give valid results.
To test this, I tried reordering the input array using the indices from both the reference sort and x86-simd-sort, and the resulting arrays match exactly.

I assume it is okay for these indices to not match exactly, since stable is set to false in this test.
I modified the test to handle testing sort/argsort differently; I'm not super familiar with this testing code, so I'm not sure if this change is reasonable

Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically we need a speical way of checking result equality. Instead of adding the special is_sort flag here, would it be more general to pass a functor for checking the equality? Then, you don't have to check the is_sort and assumes the result type in the general code.

if (stable) return false;

auto type = values.scalar_type();
if (not (type == ScalarType::Long || type == ScalarType::Int || type == ScalarType::Double || type == ScalarType::Float)) return false;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the win CI failures:

C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cpu/SortingKernel.cpp(168): error C3861: 'not': identifier not found

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update x86-simd-sort to the latest commit? This version has a build problem on openBSD. See intel/x86-simd-sort#157.

@sterrettm2
Copy link
Contributor Author

I've made the changes suggested by Raghuveer, and I changed the previous fix for the test. Now it passes a comparator function to determine how to check for equality, does this seem good to you? Let me know if you want further changes, I'm not sure what best fits Pytorch's testing code.

@@ -363,12 +363,14 @@ def check_model(
assert_equal=True,
check_gradient=False,
check_has_compiled=True,
comparator_assert=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps equal_checker is more explicit?

exact_dtype=exact_dtype,
)

# In case of input mutations, check that inputs are the same
self.assertEqual(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you move this into the default checker too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the input mutation check should be ran regardless of which checker is being used, so to avoid duplicating it I moved it outside of the checker functions to avoid duplicating it. Would you prefer it be moved into the comparison functions to simplify that part?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also applied to sort_equal_checker? I saw you have special logic for input checks there?

@@ -319,6 +319,78 @@ def wrapper_noop_set_seed(op, *args, **kwargs):
)


def default_comparator_assert(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to move this into check_model function and assign the equal_check to this default function if equal_check is None.

@sterrettm2
Copy link
Contributor Author

I've made those modifications, except for moving the input checker into the default comparator. I think that should be ran regardless of which comparator is used, so I left it out to avoid duplicating it in each comparator.

@sterrettm2
Copy link
Contributor Author

Pinging @jgong5, I think my newest commit should fix the suggestions you had, could you take a look at it?

Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may re-request my review as a notification.

exact_dtype=exact_dtype,
)

# In case of input mutations, check that inputs are the same
self.assertEqual(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also applied to sort_equal_checker? I saw you have special logic for input checks there?

@sterrettm2
Copy link
Contributor Author

equal_sort_checker does not have special logic for checking the inputs. I assume you are referring to this code:

self.assertEqual(
    ref_inputs[0][actual.indices],
    ref_inputs[0][correct.indices],
    atol=atol,
    rtol=rtol,
    equal_nan=True,
    exact_dtype=exact_dtype,
) 

If that's the case, what this code is doing is checking that the actual and correct indices are 'argsort equivalent', in the sense that if we reorder the original inputs with the indices from the argsort, we get identical results. Currently the only thing checking the inputs is that shared check that the inputs have not been modified

@sterrettm2 sterrettm2 requested a review from jgong5 August 22, 2024 16:42
jgong5
jgong5 previously approved these changes Aug 23, 2024
@jgong5
Copy link
Collaborator

jgong5 commented Aug 23, 2024

@sterrettm2 Please resolve the conflict. Also, you might have to get CLA assigned...

@sterrettm2
Copy link
Contributor Author

I've got the CLA figured out, and rebased on the current branch. I think it should all be good, once the tests have been run again

@sterrettm2 sterrettm2 requested a review from jgong5 August 27, 2024 17:38
@jgong5 jgong5 requested a review from peterbell10 August 28, 2024 02:24
@sterrettm2
Copy link
Contributor Author

Hopefully that should have resolved all of the issues with the CI, could you rerun it?

@jgong5 jgong5 added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 29, 2024
}
}else{
std::vector<scalar_t> tmp_values(dim_size);
std::vector<index_t> tmp_indices(dim_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::vector zero-fills, would be better to use c10::SmallBuffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

if not is_scalar:
self.assertEqual(
ref_inputs[0][actual],
ref_inputs[0][correct],
Copy link
Collaborator

@peterbell10 peterbell10 Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too loose for stable sort. Also can you explain why an equality check fails? What is different between inductor and eager that causes the returned indices to be different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comparison definitely isn't right for stable sorts, I agree. Are any sorts with stable=True being tested by that code? I wasn't able to tell.
A simple equality check of the indices can fail if there are repeated values, since x86-simd-sort might give the indices of repeats in a different order compared to the reference. I'm not familiar with the inductor code, but from my inspecting it seems to use this Triton kernel

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple equality check of the indices can fail if there are repeated values, since x86-simd-sort might give the indices of repeats in a different order compared to the reference.

Triton is only used for CUDA, on CPU we should be comparing at::sort to at::sort here so I would expect the indices of repeated values to be the same. Otherwise it suggests the sort is non-deterministic.

Copy link
Contributor Author

@sterrettm2 sterrettm2 Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've determined why this seemed so strange. In the check_model code, reference_in_float defaults to True, so it was converting the fp16 values to fp32 to generate the reference results; XSS sort can sort fp32 values, but not fp16, so the result was that the fp16 values were sorted using the normal sorting method, but the fp32 values were sorted using XSS, which sort equivalent values differently, causing the fp16 test failures seen before. I'll change reference_in_float to False in my next patch

@peterbell10
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.

For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.

<details>
<summary><b>Contiguous Benchmarks</b></summary>

```
float32, normally distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.150844336    6.886271477    7.132277489    1.038420335    1.002603214
128            9.208030939    8.478154898    7.846915245    1.086089019    1.173458697
1024           37.79037627    23.60707456    16.44122627    1.600807257    2.298513241
10000          714.7355628    203.9921844    105.5683001    3.503739934    6.770361577
100000         8383.074408    721.6333354    465.3709247    11.61680593    18.01374766
1000000        97124.31945    5632.054572    3920.148401    17.24491803    24.77567416
10000000       1161974.907    86070.48988    71533.82301    13.50027063    16.24371323

int32_t, uniformly distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.203208685    6.92212224     7.014458179    1.040606975    1.026908779
128            8.972388983    8.195516348    7.592543125    1.094792396    1.18173698
1024           32.77489477    23.6874548     15.36617105    1.383639359    2.132925285
10000          607.8824128    193.3402024    99.25090471    3.144107667    6.124703997
100000         523.9384684    608.1836536    442.3166784    0.861480682    1.184532472
1000000        5211.348627    5271.598405    3518.861883    0.988570871    1.480975611
10000000       133853.6263    81463.05084    67852.97394    1.643120714    1.972700952
```

</details>

Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.

<details>
<summary><b>Discontiguous Benchmarks</b></summary>

```
float, normal distributed, discontiguous in sorted dimension (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.836543679    4.011214256    3.84376061     0.956454439    0.99812243
128            5.755310194    5.755723127    4.820394962    0.999928257    1.193949923
1024           49.46946019    24.78790785    15.47874362    1.995709379    3.195960952
10000          665.2505291    236.6165959    143.9490662    2.811512551    4.621429974
100000         4328.002203    1329.001212    818.3516414    3.256582586    5.288682743
1000000        47651.5018     16693.72045    11827.39551    2.854456677    4.028909133
10000000       556655.1288    236252.6258    184215.9828    2.356185998    3.021752621

int32_t, uniformly distributed, discontiguous in sorted dimension  (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.817994356    3.878117442    3.770039797    0.984496837    1.012719908
128            5.578731397    5.577152082    4.716770534    1.000283176    1.182743862
1024           43.3412619     23.61275801    14.55446819    1.835501887    2.977866408
10000          634.3997478    224.4322851    133.9518324    2.826686667    4.736028889
100000         4084.358152    1292.363303    781.7867576    3.16037924     5.22438902
1000000        46262.20465    16608.35284    11367.51817    2.785478192    4.06968381
10000000       541231.9104    235185.1861    180249.9294    2.301301028    3.002674742
```

</details>

Pull Request resolved: pytorch#127936
Approved by: https://github.com/jgong5, https://github.com/peterbell10, https://github.com/sanchitintel
atalman added a commit to atalman/pytorch that referenced this pull request Nov 29, 2024
@atalman
Copy link
Contributor

atalman commented Dec 2, 2024

@pytorchmergebot revert -c nosignal -m "Looks like its the cause of the std::bad_alloc failure"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Reverting PR 127936 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit 7e650604105ee74529f6352f9a81cc68f32c2d56 returned non-zero exit code 1

Auto-merging aten/src/ATen/native/cpu/SortingKernel.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/native/cpu/SortingKernel.cpp
Auto-merging cmake/Dependencies.cmake
Auto-merging cmake/Summary.cmake
Auto-merging test/inductor/test_torchinductor_opinfo.py
error: could not revert 7e650604105... Adds support for accelerated sorting with x86-simd-sort (#127936)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

atalman added a commit to atalman/pytorch that referenced this pull request Dec 2, 2024
atalman added a commit to atalman/pytorch that referenced this pull request Dec 2, 2024
pytorchmergebot pushed a commit that referenced this pull request Dec 3, 2024
@sanchitintel
Copy link
Collaborator

Hi @sterrettm2, what are the memory requirements of this technique? Thanks

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
@sterrettm2
Copy link
Contributor Author

The amount of memory it uses shouldn't be particularly high; depending on the situation. it either doesn't allocate, or allocates two temporary buffers, both the length of the dimension being sorted. I'm still trying to replicate this failure locally, so I haven't been able to figure out what's going on yet

@leslie-fang-intel
Copy link
Collaborator

leslie-fang-intel commented Feb 17, 2025

The amount of memory it uses shouldn't be particularly high; depending on the situation. it either doesn't allocate, or allocates two temporary buffers, both the length of the dimension being sorted. I'm still trying to replicate this failure locally, so I haven't been able to figure out what's going on yet

Hi @sterrettm2, may I know the status of this PR? Did you replicate this failure?

@sterrettm2
Copy link
Contributor Author

I have finally managed to replicate the failure locally, and have begun working on debugging it. Hopefully I should have more news relatively soon.

@mingfeima
Copy link
Collaborator

@sterrettm2 could you please ping me in teams internally, my email is [email protected]. Thanks! We need to track the status on this issue: #140590

@sterrettm2
Copy link
Contributor Author

Hello,
My new patch updates x86-simd-sort which should fix the std::bad_alloc crashes that were observed previously.

As far as I can tell, these crashes are caused by a compiler bug in certain GCC versions, where it generates MMX instructions without the EMMS instruction.
It seems that variants of this bug have appeared multiple times in GCC (bugs 117926, 91533, and 80799 to list a few).
As a workaround, we simply add the EMMS intrinsic to the end of the functions if MMX is enabled.

I am quite confident this fixes the issue, since the std::bad_alloc occurs in the function std::__detail::_Hashtable_alloc, which is exactly the same as the reproduction I have for the MMX/EMMS issue in GCC.

Pytorch bad_alloc trace
#0  __cxxabiv1::__cxa_throw (obj=0x55555f9292e0, tinfo=0x7ffff3beefb8 <typeinfo for std::bad_alloc>,
    dest=0x7ffff3ad3020 <std::bad_alloc::~bad_alloc()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:80
#1  0x00007ffff3ace38a in std::__throw_bad_alloc () at ../../../../../libstdc++-v3/src/c++11/functexcept.cc:54
#2  0x00007fff4c72dcd6 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<stddex const, std::vector<bool (*)(_object*, void*&), std::allocator<bool (*)(_object*, void*&)> > >, false> > >::_e_buckets(unsigned long) [clone .isra.0] ()
   from /home/sterrettm/miniforge3/envs/build_binary/lib/python3.13/site-packages/torch/lib/libtorch_python.so
#3  0x00007fff4c784f8f in std::_Hashtable<std::basic_string<char, std::char_traits<char>, std::allocator<char> >ir<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, pybind11::object>, std::allocatair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, pybind11::object> >, std::__delect1st, std::equal_to<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::ing<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Dnged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_rehash long, unsigned long const&) ()
Reproduction bad_alloc trace
#0  0x00007ffff7cbb35a in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007ffff7ca90db in std::__throw_bad_alloc() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x0000555555555476 in __gnu_cxx::new_allocator<std::__detail::_Hash_node_base*>::allocate (
    this=<synthetic pointer>, __n=18446744073709551557) at /usr/include/c++/9/ext/new_allocator.h:105
#3  std::allocator_traits<std::allocator<std::__detail::_Hash_node_base*> >::allocate (__a=<synthetic pointer>...,
    __n=18446744073709551557) at /usr/include/c++/9/bits/alloc_traits.h:443
#4  std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<unsigned long const, unsigned long>, false> > >::_M_allocate_buckets (this=0x7fffffffdef0, __n=18446744073709551557)
    at /usr/include/c++/9/bits/hashtable_policy.h:2134

@mingfeima @leslie-fang-intel @sanchitintel The issue should be fixed now, are you able to reopen this PR? Or would it be better for me to just make a new pull request for this patch?

@leslie-fang-intel
Copy link
Collaborator

Hi @sterrettm2, thanks for the detail analysis. I also would like to understand more about why this issue only happens in Torch Library but not found with Torch Core CI. It may help us preventing similar issue happens again. Do you have any clue?

@mingfeima @leslie-fang-intel @sanchitintel The issue should be fixed now, are you able to reopen this PR? Or would it be better for me to just make a new pull request for this patch?

I am not able to re-open this PR, cc @atalman for whether a new PR is needed here.

@sterrettm2
Copy link
Contributor Author

So I'm not certain why this problem appears in the nightly builds but not the CI. I've actually not been able to find any version of GCC that generates the pxor instructions while building my patched version of Pytorch, yet the libtorch_cpu.so in the nightly wheel does have pxor without emms, causing the bad_alloc. Is it known exactly what compiler version is being used to generate the nightly wheels, and in particular which compiler was being used at the time of this issue?

@leslie-fang-intel
Copy link
Collaborator

Is it known exactly what compiler version is being used to generate the nightly wheels, and in particular which compiler was being used at the time of this issue?

Hi @atalman, could you help with this question or any comments on how to add a test case in PyTorch core to guard against this failure?

@sterrettm2
Copy link
Contributor Author

Okay, the compiler seems to be GCC: (GNU) 9.3.1 20200408 (Red Hat 9.3.1-2) from inspecting the built files in the wheel. Trying with GCC 9.3.0 built from source seems to show correct code generation though (and no use of pxor), so I wonder if there is some strange configuration on the systems building the wheels.

@r-devulap
Copy link

@leslie-fang-intel we have seen these kind of problems in NumPy too (see numpy/numpy@b3eed14 for example). I would recommend adding -fno-mmx to build flag if you can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/slow ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor open source Reverted topic: improvements topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.