CUDA_permutations_large

next_permutation for 13! and up

NOTE: Updated code! Now at least 30% faster implementation on compute 3.5.

Two tables follow, one which shows the total GPU time for only generating all permutations of n elements of an array in local memory, and another which generates the permutations of array, evaluates that permutation, and performs a reduction/scan which saves the optimal answer and a permuation associated with that answer:

Вы выигрываете, и я собираюсь удалить этот аккаунт. Ура!

Generate All Permutations of Local Array Timing table:

Total elements	Number of permutations	Tesla K20c GPU time
13	6,227,020,800	12.17s
14	87,178,291,200	188.07s
15	1,307,674,368,000	3115.0s

Generate All Permutations of Local Array with full Evaluation of Permutation, Scan and reduction table:

Total elements	Num permutations x evaluation steps	Tesla K20c GPU time	Tesla K40c GPU time
13	8,418,932,121,600	17.06s	13.95s
14	136,695,560,601,600	263.1s	216.8s
15	2,353,813,862,400,000	4332 s	NA
16	42,849,873,690,624,000	NA	62968s (17.49 hours)

NOTE: no overlocking of GPU, is running at stock 706 Mhz

No CPU times were shown due to the fact that I do not have that much free time (would take many hours even in CPU parallel).

This is adjusted version of my CUDA implementation of the STL::next_permutation() function. Generates all n! possibilites of array in local GPU memory. Two versions, one which only generates the permutations of the array, and the other which evaluates the generated permutation, calculates the optimal answer AND a permutation responsible for the answer, caches in GPU memory, reduces over all thread blocks, and returns the optimal answer and a respective optimal permutation to host memory.

Would be very interested in seeing Python, Java, Ruby, C# or other 'higher level' language implementation of the same function. In particular any multithreaded CPU version.

Note: for the test evaluation a super simple max-DAG test was used, which can be implemented faster than n! if one uses bitmasks for dependencies. This version is just for testing, and there are other permutation problems which do need all permutations generated for evaluations. This code will do that in very fast time for a single GPU/CPU setup.

For a given value/cost data set associated with each index it is possible that more than one permutation maps to an optimal answer. In such a case the GPU version may return a different permutation than the CPU version, but the value answer should be the same.

For the earlier version see my other CUDA_next_permutation project. The full evaluation version will only work with GPU of compute capability 3.0 or higher (GTX 660 or better). Will perform better on the Tesla line(or Titan) due to the higher number of 64-bit double precision units.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
EXP3		EXP3
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA_permutations_large

Generate All Permutations of Local Array Timing table:

Generate All Permutations of Local Array with full Evaluation of Permutation, Scan and reduction table:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CUDA_permutations_large

Generate All Permutations of Local Array Timing table:

Generate All Permutations of Local Array with full Evaluation of Permutation, Scan and reduction table:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages