-
Notifications
You must be signed in to change notification settings - Fork 24.1k
Adds support for accelerated sorting with x86-simd-sort #127936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127936
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit faddfa4 with merge base 6fc63b4 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
May I know if the benchmark was collected with single core or multi-core? Can you share the result for both configurations? |
This was on a Skylake machine (7900x), set to 8 cores. Here's the equivalent benchmarks for single core: Single Core Performance
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The perf numbers LGTM. Seems CI failures are related. Could you please fix them first?
I've spent a while looking at the test failure, and I think the test itself is wrong. The input array has some duplicate elements. I assume it is okay for these indices to not match exactly, since stable is set to false in this test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically we need a speical way of checking result equality. Instead of adding the special is_sort
flag here, would it be more general to pass a functor for checking the equality? Then, you don't have to check the is_sort
and assumes the result type in the general code.
if (stable) return false; | ||
|
||
auto type = values.scalar_type(); | ||
if (not (type == ScalarType::Long || type == ScalarType::Int || type == ScalarType::Double || type == ScalarType::Float)) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the win CI failures:
C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cpu/SortingKernel.cpp(168): error C3861: 'not': identifier not found
third_party/x86-simd-sort
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update x86-simd-sort to the latest commit? This version has a build problem on openBSD. See intel/x86-simd-sort#157.
I've made the changes suggested by Raghuveer, and I changed the previous fix for the test. Now it passes a comparator function to determine how to check for equality, does this seem good to you? Let me know if you want further changes, I'm not sure what best fits Pytorch's testing code. |
test/inductor/test_torchinductor.py
Outdated
@@ -363,12 +363,14 @@ def check_model( | |||
assert_equal=True, | |||
check_gradient=False, | |||
check_has_compiled=True, | |||
comparator_assert=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps equal_checker
is more explicit?
test/inductor/test_torchinductor.py
Outdated
exact_dtype=exact_dtype, | ||
) | ||
|
||
# In case of input mutations, check that inputs are the same | ||
self.assertEqual( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you move this into the default checker too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the input mutation check should be ran regardless of which checker is being used, so to avoid duplicating it I moved it outside of the checker functions to avoid duplicating it. Would you prefer it be moved into the comparison functions to simplify that part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also applied to sort_equal_checker
? I saw you have special logic for input checks there?
@@ -319,6 +319,78 @@ def wrapper_noop_set_seed(op, *args, **kwargs): | |||
) | |||
|
|||
|
|||
def default_comparator_assert( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to move this into check_model
function and assign the equal_check
to this default function if equal_check
is None
.
I've made those modifications, except for moving the input checker into the default comparator. I think that should be ran regardless of which comparator is used, so I left it out to avoid duplicating it in each comparator. |
Pinging @jgong5, I think my newest commit should fix the suggestions you had, could you take a look at it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may re-request my review as a notification.
test/inductor/test_torchinductor.py
Outdated
exact_dtype=exact_dtype, | ||
) | ||
|
||
# In case of input mutations, check that inputs are the same | ||
self.assertEqual( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also applied to sort_equal_checker
? I saw you have special logic for input checks there?
If that's the case, what this code is doing is checking that the actual and correct indices are 'argsort equivalent', in the sense that if we reorder the original inputs with the indices from the argsort, we get identical results. Currently the only thing checking the inputs is that shared check that the inputs have not been modified |
@sterrettm2 Please resolve the conflict. Also, you might have to get CLA assigned... |
0bf8bdf
to
98fbc70
Compare
I've got the CLA figured out, and rebased on the current branch. I think it should all be good, once the tests have been run again |
Hopefully that should have resolved all of the issues with the CI, could you rerun it? |
} | ||
}else{ | ||
std::vector<scalar_t> tmp_values(dim_size); | ||
std::vector<index_t> tmp_indices(dim_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector
zero-fills, would be better to use c10::SmallBuffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
if not is_scalar: | ||
self.assertEqual( | ||
ref_inputs[0][actual], | ||
ref_inputs[0][correct], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too loose for stable sort. Also can you explain why an equality check fails? What is different between inductor and eager that causes the returned indices to be different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That comparison definitely isn't right for stable sorts, I agree. Are any sorts with stable=True being tested by that code? I wasn't able to tell.
A simple equality check of the indices can fail if there are repeated values, since x86-simd-sort might give the indices of repeats in a different order compared to the reference. I'm not familiar with the inductor code, but from my inspecting it seems to use this Triton kernel
def sort_with_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple equality check of the indices can fail if there are repeated values, since x86-simd-sort might give the indices of repeats in a different order compared to the reference.
Triton is only used for CUDA, on CPU we should be comparing at::sort
to at::sort
here so I would expect the indices of repeated values to be the same. Otherwise it suggests the sort is non-deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've determined why this seemed so strange. In the check_model code, reference_in_float defaults to True, so it was converting the fp16 values to fp32 to generate the reference results; XSS sort can sort fp32 values, but not fp16, so the result was that the fp16 values were sorted using the normal sorting method, but the fp32 values were sorted using XSS, which sort equivalent values differently, causing the fp16 test failures seen before. I'll change reference_in_float to False in my next patch
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads. <details> <summary><b>Contiguous Benchmarks</b></summary> ``` float32, normally distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214 128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697 1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241 10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577 100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766 1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416 10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323 int32_t, uniformly distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779 128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698 1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285 10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997 100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472 1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611 10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952 ``` </details> Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction. <details> <summary><b>Discontiguous Benchmarks</b></summary> ``` float, normal distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243 128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923 1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952 10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974 100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743 1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133 10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621 int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908 128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862 1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408 10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889 100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902 1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381 10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742 ``` </details> Pull Request resolved: pytorch#127936 Approved by: https://github.com/jgong5, https://github.com/peterbell10, https://github.com/sanchitintel
…rch#127936)" This reverts commit 7e65060.
@pytorchmergebot revert -c nosignal -m "Looks like its the cause of the std::bad_alloc failure" |
@pytorchbot successfully started a revert job. Check the current status here. |
Reverting PR 127936 failedReason: Command
Details for Dev Infra teamRaised by workflow job |
…rch#127936)" This reverts commit 7e65060.
…rch#127936)" This reverts commit 7e65060.
…d-sort (#127936) (#141901) Looks like the original PR caused: #140590 Please see comment: #140590 (comment) Pull Request resolved: #141901 Approved by: https://github.com/andrewor14, https://github.com/malfet
Hi @sterrettm2, what are the memory requirements of this technique? Thanks |
…d-sort (pytorch#127936) (pytorch#141901) Looks like the original PR caused: pytorch#140590 Please see comment: pytorch#140590 (comment) Pull Request resolved: pytorch#141901 Approved by: https://github.com/andrewor14, https://github.com/malfet
The amount of memory it uses shouldn't be particularly high; depending on the situation. it either doesn't allocate, or allocates two temporary buffers, both the length of the dimension being sorted. I'm still trying to replicate this failure locally, so I haven't been able to figure out what's going on yet |
Hi @sterrettm2, may I know the status of this PR? Did you replicate this failure? |
I have finally managed to replicate the failure locally, and have begun working on debugging it. Hopefully I should have more news relatively soon. |
@sterrettm2 could you please ping me in teams internally, my email is [email protected]. Thanks! We need to track the status on this issue: #140590 |
Hello, As far as I can tell, these crashes are caused by a compiler bug in certain GCC versions, where it generates MMX instructions without the EMMS instruction. I am quite confident this fixes the issue, since the Pytorch bad_alloc trace
Reproduction bad_alloc trace
@mingfeima @leslie-fang-intel @sanchitintel The issue should be fixed now, are you able to reopen this PR? Or would it be better for me to just make a new pull request for this patch? |
Hi @sterrettm2, thanks for the detail analysis. I also would like to understand more about why this issue only happens in Torch Library but not found with Torch Core CI. It may help us preventing similar issue happens again. Do you have any clue?
I am not able to re-open this PR, cc @atalman for whether a new PR is needed here. |
So I'm not certain why this problem appears in the nightly builds but not the CI. I've actually not been able to find any version of GCC that generates the |
Hi @atalman, could you help with this question or any comments on how to add a test case in PyTorch core to guard against this failure? |
Okay, the compiler seems to be |
@leslie-fang-intel we have seen these kind of problems in NumPy too (see numpy/numpy@b3eed14 for example). I would recommend adding |
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.
For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.
Contiguous Benchmarks
Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.
Discontiguous Benchmarks
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @peterbell10