-
Notifications
You must be signed in to change notification settings - Fork 53
NTT #340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
NTT #340
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…code. Updated maxBpw tables. Changed -tune to handle new FFT spec code. Added TABMUL_CHAIN option (it did not deserve to be a WMH code because it has little impact on Z).
…at most 500 kernels, a marker, and 500 more kernels. When prpll tries to add another kernel to the queue, we loop checking if the marker has been reached else performing a lengthy sleep (knowing there are 500 kernels to execute after the marker).
…ktodo files. Autoprimenet needs this change.
…useful FFT spec could be found in tune.txt
…pported by nVidia GPUs.
…ce condition were launched thread accesses uninitialized variables.
…nup work remains.
…mentations that are not faster (at least on TitanV).
…word / little word range. Carryutil sloppy routines may or may not use this feature in the future.
…r more sloppy carries which gives a tiny performance boost for M31+M61 NTTs.
…TitanV. Explored alternate weakMul and csq implementations.
…need to expose the MIDDLE_CHAINMUL option to the end user.
…iant,TAIL_TRIGS32,TABMUL_CHAIN32 settings.
… command line argument -smallest
…good size L2 cache.
… an RTX 5080 but 1% slower on a Titan V.
…HAIN31=1 which is rarely set.
…e I'll figure out a may to use them profitably in the future.
…han unsigned long
…t. Now we just need to automatically detect the GPU's actual level of CUDA support.
…e fails if file is open.
… does not support the builtins required for variant zero. As a poor workaround, this change let's the user specify NO_ASM to bypass tuning FP64 variant zero.
…od AMD memory layout. I have been unable to find a memory layout nearly as good as the INPLACE=0 layout.
…ve roundoff errors.
…ng on use of fma function on floats). Some minor changes on wording of tune output.
Owner
|
Thank you for this great work! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I think its ready for merging!