Please follow the instructions in the contribution guidelines.
python ./setup.py bdist_wheel -- -DAER_THRUST_BACKEND=CUDA -DCMAKE_CUDA_COMPILER=${YOUR_NVIDIA_COMPILER_PATH}To run experiments, please see this repo
To enable multi-GPU simulation, first set up this environment variable
export AER_MULTI_GPU=1And setting this env variable will output some debug information
export QCDEBUG=1| Branch name | Description |
|---|---|
| naive | Naive approach |
| overlap | Proactive data transfer using cuda streams |
| pruning | Pruning unnecessary zero-valued amplitudes transfer |
| reorder | Reordering to enlarge pruning potential |
| compression | Reducing non-zero amplitudes transfer using data compression |
| multi-gpu | Multi-GPU version |
| master | Same as multi-gpu |
| implementation-XXX | Draft branches, can be deleted |
Q-GPU includes following optimizations
Revised set_num_qubits, it allocates all memory on CPU (to store all state amplitudes), and buffers on GPU (for computation).
Reconstructed apply_function, instead of on-demand data transfer, it will iterate through all chunks for each operation. In general, the work-flow of this funciton is (1) Copy in (2) Decompression (3) Computation (4) Compression (5) Copy back.
Added a new function reorder_circuit, it traverses the circuit in topological order and reorders the execution of operations by greedily select an operation that will involve least qubits.
And other minor necessary revisions to QubitVectorChunkContainer.
The local vector buf2chunk needs to be corrected. Since in Q-GPU, the layout of chunks on GPU is different from the original layout on CPU. Thus currently, Q-GPU outputs wrong simulation results (final amplitudes) for large circuits. This doesn's affect the performance results (i.e. the execution time) in the naive, overlap, pruning and reorder. However, it affects the performance results in compression. Hopefully I will fix it by october.