feat: tracegen on GPU #2034

jonathanpwang · 2025-08-21T21:44:52Z

No description provided.

* histogram ready and tested * var_range tracegen * half of the test with dummy chip * new tracegen with warp primitives * Buffer -> Matrix * DeviceProofInput * test passed * -1 allocation

* feat: RangeTupleChecker tracegen * Addressed PR comments * Addressed more PR comments --------- Co-authored-by: Christian Altamirano <bdiehs>

* chore: WIP on is_equal CUDA tracegen impl * feat: CUDA tracegen of is_equal, WIP: tests * fix: extensive testing for is_equal * fix: reimplemented is_equal as a helper __device__ function * feat: added is_equal_array, WIP: tests * WIP: fixing tests * feat: is_equal and is_equal_array with tests * chore: resolved pr comments * chore: resolved iterator PR comments * fix: inputs to subairs are now Fp, fixed is_equal and is_zero subair structure

- Moves cuda/kernels/backend into crates/backend by a) moving CUDA files into crates/backend/cuda and b) moving wrapper .rs files to their appropriate locations in crates/backend/src - Moves cuda/kernels/tracegen into crates/tracegen analogously to above - Add .clang-format file and a script to generate .clangd, which allows devs to use Intellisense properly - Move cuda/fields and cuda/utils into crates/backend/cuda/include and crates/backend/src respectively - Add tracegen documentation (i.e. README and comments)

* auipc chip init * cpu test passed * auipc trace generated * review-based changes

* feat: deviceBuffer fill zero * all histograms are zeroed

* new-execution-e4 > new-execution * return back benchmarks * test fix * kitchen_sink fix * disable halo2 in tests * larger machine for kitchen_sink

* chore: WIP on is_equal CUDA tracegen impl * feat: CUDA tracegen of is_equal, WIP: tests * fix: extensive testing for is_equal * fix: reimplemented is_equal as a helper __device__ function * feat: added is_equal_array, WIP: tests * WIP: fixing tests * feat: is_equal and is_equal_array with tests * chore: resolved pr comments * WIP: poseidon2 cuda tracegen * WIP: poseidon2 cuda tracegen - missing tests * fix: minor reference fixes * WIP: poseidon2 tracegen, need to debug, and fix linear layers and find what round constants are used * fix, wip: changed memory layout for tracegen * feat: GPU tracegen matches CPU tracegen - need to cleanup code for PR * feat: poseidon2 cuda tracegen + tests * chore: cleaned up code * chore: resolved PR comments with weak definitions * chore: hardcoded constants into backend header * chore: renamed header to constants * chore: changed poseidon2 tracegen input to be rowmaj, refactored test

* Cuda tracegen + tests for Rv32HintStore * fix: MemoryWriteAuxAdapter * fill zero for some fields * reviewer comments

* WIP: jalr adapter * feat: finished JALR core and adapter, waiting on GPU harness for tests * chore: renaming fix * Cuda tracegen + tests for Rv32HintStore * fix: MemoryWriteAuxAdapter * fill zero for some fields * resolved PR --------- Co-authored-by: Arayi <[email protected]>

* feat: init Rv32MultAdapterChip tracegen * feat: mul chip + tests (passing) * refactor: move mod.rs + tests into one file * refactor: address pr comments * chore: remove stray constant * refactor: incorporate cuda.rs * refactor: move auipc test to cuda.rs

* WIP: jalr adapter * feat: finished JALR core and adapter, waiting on GPU harness for tests * chore: renaming fix * feat: jal_lui core + adapter cuda tracegen, no tests * fix: minor import fixes * Cuda tracegen + tests for Rv32HintStore * fix: MemoryWriteAuxAdapter * fill zero for some fields * fix: removed mem aux * chore: resolved PR comments --------- Co-authored-by: Arayi <[email protected]>

* Cuda tracegen + tests for Rv32HintStore * fix: MemoryWriteAuxAdapter * fill zero for some fields

* feat: init rv32im less than tracegen impl (broken) * feat: full less_than and base alu tracegen, test passing * chore: make tests less verbose * fix: use set_trace_buffer_height for dense chip * feat: init rv32im shift tracegen (broken) * refactor: remove duplicate imports * fix: make test compile * refactor: readability * feat: use various opcodes in test * feat: test both SLT and SLTU * chore: revert trace comparing util function * chore: clean up unused imports * chore: delete old test file * feat: use generic test harness * feat: use generic test harness * chore: clean up unused imports * feat: bring over less than test * refactor: some pr comments + proper test setup * refactor: pr comments * refactor: pr comments * chore: revert mul fixes, do them in other branch * refactor: minor pr comments * fix: debug * fix: try old method for zeroing out extra rows again * fix: pass width argument to zero out extra rows correctly * fix: make test actually match CPU equivalent * chore: remove excessive imports * feat: rv32im ALU chip tracegen (#104) * feat: init rv32im alu chip tracegen * fix: make test actually match CPU equivalent

* WIP: jalr adapter * feat: finished JALR core and adapter, waiting on GPU harness for tests * chore: renaming fix * wip: blt tracegen, kernel done * Cuda tracegen + tests for Rv32HintStore * fix: MemoryWriteAuxAdapter * fill zero for some fields * feat: blt tracegen + tests * chore: style * feat: beq tracegen + tests * chore: minor fixes * fix: removed memwrite memeread aux adapters * chore: resolved PR comments, optimized code a bit --------- Co-authored-by: Arayi <[email protected]>

* wip: mulh tracegen * feat: mulh tracegen + tests * chore: small fix * fix: minor import fixes from OpenVM * fix: pass in tuple size by value * fix: rangetuple * test: remove rangetuple * fix: initialize run to zero * fix: pass in range tuple by value

* Cuda tracegen + tests for Rv32DivRem * review comments

* chore: lint workflow CI + codespell ignore file * chore: codespell fixes pt. 1 * chore: clippy fixes pt. 1 * chore: lints.yml revert * chore: lints workflow working directory * chore: rust fmt and cargo clippy fixes * chore: rebase lints * chore: linter needs to run on GPU-compatible device * chore: custom GPU image needs to install codespell * chore: try non-custom image * chore: try docker install * chore: separate clippy to different job * chore: rename lints jobs

* cuda tracegen + tests for castf * cuda tracegen + test for native branch eq

* fix: write random values to tester for mul * feat: use rangetuple checker in fill_trace_row * fix: debug * fix: revert debug * fix: fill range checker with zeros * fix: make tester.execute actually match CPU * fix: pass range tuple sizes by value * fix: [debug] revert range tuple checker, all arguments as u32 * fix: [debug] revert d_records to u8 * fix: use device buffer for range tuple sizes * fix: use UInt2 for range tuple sizes * chore: lint

* Cuda tracegen + tests for Rv32HintStore * fix: MemoryWriteAuxAdapter * fill zero for some fields * cuda tracegen and tests for load sign extend * cuda tracegen + tests for loadstore * fix the loadstore tests and add volatile constructor for GpuTestBuilder * fix merge * fix lints * cuda tracegen for rv32divrem * fix size buffer * reviewer comments * remove unnecessary diff * review comments * cuda tracegen + tests for castf * remove unnecessary dependency * cuda tracegen + test for native branch eq * lints * feat: init native field arithmetic tracegen impl * feat: impl alu_native_adapter tracegen (broken) * lints * fix: make trace match * refactor: readability * refactor: lint * chore: remove unnecessary import * refactor: format with fmt * refactor: minor pr comments * feat: impl generic MemoryWriteAuxRecord instead of bytes --------- Co-authored-by: Arayi <[email protected]>

Towards INT-4744. This moves the files from `tracegen/{cuda/,}src/system` to somewhere in openvm. This currently compiles both with and without `--features cuda` (well on a machine without cuda it won't compile with `--features cuda`). The tests don't compile, but it's because the testing utilities are missing. --- todo list: - [x] move stuff to `cuda/system/` - [x] make tests compile (except the "undefined test utility" thing) - [x] feature gate cuda dependencies - [x] feature gate build script --------- Co-authored-by: Alexander Golovanov <Sample text>

Towards INT-4700 Migrates GpuTestBuilder with related testing cuda files, made a new trait called `TestBuilder` to be used in `set_and_execute` functions to make them general. Moved `memory`, `phantom`, and `public` values system gpu tests to the corresponding cpu test files so that they can share some code

Relates to INT-4744

towards INT-4744 --------- Co-authored-by: Alexander Golovanov <Sample text>

Relates to INT-4744 update workflow and test function to run riscv test vectors on gpu --------- Co-authored-by: Jonathan Wang <[email protected]>

Relates to INT-4744 - [x] - guest-libs/ruint - [x] - guest-libs/keccak - [x] - guest-libs/sha - [x] - guest-libs/k256 - [x] - guest-libs/p256 - [x] - guest-libs/pairing - [x] - guest-libs/ff_derive - [x] - guest-libs/verify_stark --------- Co-authored-by: stephenh-axiom-xyz <[email protected]> Co-authored-by: Jonathan Wang <[email protected]>

Resolves INT-4845

Resolves INT-4699 --------- Co-authored-by: Jonathan Wang <[email protected]>

Closes INT-4844 - [x] extensions - [x] sdk - [x] benchmarks - [x] guest-libs --------- Co-authored-by: Jonathan Wang <[email protected]>

codspeed-hq · 2025-08-24T07:52:30Z

CodSpeed WallTime Performance Report

Merging #2034 will degrade performances by 11.29%

_{Comparing feat/tracegen-gpu (be3bbb9) with main (fd362bc)}

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

⚡ 1 improvements
❌ 1 regressions
✅ 28 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
❌	`benchmark_execute[bubblesort]`	21.5 ms	24.2 ms	-11.29%
⚡	`benchmark_execute[sha256_iter]`	60.5 ms	54.3 ms	+11.26%

Resolves INT-4847 --------- Co-authored-by: Alexander Golovanov <Sample text> Co-authored-by: Jonathan Wang <[email protected]>

Copilot

Pull Request Overview

This PR introduces GPU-accelerated trace generation capabilities to the OpenVM zero-knowledge proof system, enabling CUDA-based trace generation for various system components.

Key changes:

Adds CUDA kernel bindings and GPU chip implementations for system components (memory, phantom, public values, etc.)
Implements hybrid CPU/GPU chip architecture with specialized GPU trace generation
Adds comprehensive test infrastructure for GPU vs CPU trace equivalence validation

Reviewed Changes

Copilot reviewed 157 out of 392 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
extensions/algebra/circuit/build.rs	Adds CUDA build configuration for algebra circuit extension
extensions/algebra/circuit/Cargo.toml	Adds CUDA feature dependencies and build requirements
crates/vm/src/utils/stark_utils.rs	Implements conditional GPU/CPU engine selection based on CUDA feature
crates/vm/src/system/*/tests.rs	Adds GPU trace generation tests with CPU equivalence validation
crates/vm/src/system/cuda/*	Implements GPU-accelerated system components and CUDA kernel interfaces
crates/vm/src/arch/testing/*	Adds GPU testing infrastructure and hybrid chip test harnesses

Comments suppressed due to low confidence (2)

crates/vm/src/system/cuda/memory.rs:29

[nitpick] The comment mentions taking 'extra care not to use memory we don't own' but doesn't specify what precautions are actually taken. Consider documenting the specific safety measures or ownership constraints.

    pub boundary: BoundaryChipGPU,

crates/vm/src/system/cuda/boundary.rs:202

The TODO suggests avoiding a copy operation which could impact performance. Consider implementing a zero-copy approach or using move semantics to eliminate the unnecessary copy.

    use rand::Rng;

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

crates/vm/src/system/cuda/merkle_tree/mod.rs

crates/vm/src/arch/testing/cuda.rs

github-actions · 2025-08-24T09:12:22Z

group	app.proof_time_ms	app.cycles	app.cells_used	leaf.proof_time_ms	leaf.cycles	leaf.cells_used
verify_fibair	(-1810 [-86.1%]) 293	322,610	(-16691166 [-89.0%]) 2,058,654	-	-	-
fibonacci	(-1347 [-56.9%]) 1,020	1,500,210	(-50444275 [-97.9%]) 1,060,232	-	-	-
regex	(-4497 [-60.0%]) 2,997	4,108,586	(-151328688 [-91.9%]) 13,406,304	-	-	-
ecrecover	(-224 [-16.1%]) 1,169	140,497	(-6591598 [-74.3%]) 2,275,056	-	-	-
pairing	(-2228 [-57.7%]) 1,632	1,882,939	(-75111671 [-76.0%]) 23,722,622	-	-	-

Commit: be3bbb9

Benchmark Workflow

gaxiom and others added 30 commits August 19, 2025 08:56

feat: Restructuring backend + tracegen (#59)

8e34cc4

feat: RangeChecker tracegen (#61)

799462c

* histogram ready and tested * var_range tracegen * half of the test with dummy chip * new tracegen with warp primitives * Buffer -> Matrix * DeviceProofInput * test passed * -1 allocation

feat: RangeTupleChecker tracegen (#63)

79264ca

* feat: RangeTupleChecker tracegen * Addressed PR comments * Addressed more PR comments --------- Co-authored-by: Christian Altamirano <bdiehs>

feat: BitwiseOperationLookupChip CUDA trace generation and tests (#66)

6fd76bd

feat: encoder SubAir CUDA tracegen + tests (#69)

65f4c13

feat: auipc chip & test (#74)

5cff136

* auipc chip init * cpu test passed * auipc trace generated * review-based changes

feat: deviceBuffer fills zero (#78)

b50e0cd

* feat: deviceBuffer fill zero * all histograms are zeroed

chore: new-execution-e4 -> new-execution (#82)

83a6a4b

* new-execution-e4 > new-execution * return back benchmarks * test fix * kitchen_sink fix * disable halo2 in tests * larger machine for kitchen_sink

fix: all columns should be filled in IsEqualArray tracegen (#83)

ae4a03f

fix: write array indexing (#87)

6afcee3

fix: use aux len for mem write adapter (#88)

c9e2ce6

feat: GPU tracegen test harness (#80)

cfd5f0c

feat: system Poseidon2 GPU tracegen and buffer (#85)

6f89c57

feat: cuda tracegen + tests for Rv32HintStore (#92)

c38ac88

* Cuda tracegen + tests for Rv32HintStore * fix: MemoryWriteAuxAdapter * fill zero for some fields * reviewer comments

feat: cuda tracegen+tests for rv32 loadstore and load_sign_extend (#95)

092dcb3

* Cuda tracegen + tests for Rv32HintStore * fix: MemoryWriteAuxAdapter * fill zero for some fields

feat: cuda tracegen + tests for divrem (#108)

b216694

* Cuda tracegen + tests for Rv32DivRem * review comments

fix: VariableRangeChecker number of bins should be buffer length (#109)

c97a386

feat: cuda tracegen + tests for native castf and branch eq (#112)

346863c

* cuda tracegen + tests for castf * cuda tracegen + test for native branch eq

jonathanpwang and others added 19 commits August 19, 2025 08:59

chore: migrate rv32im CUDA code (#2006)

3ae22da

chore: migrate bigint CUDA code (#2021)

e7cb1d6

chore: migrate sha256 CUDA code (#2009)

4b71e7c

chore: migrate keccak256 to gpu (#2024)

07ab284

Relates to INT-4744

chore(cuda): migrate native extensions (#2008)

4183e3e

towards INT-4744 --------- Co-authored-by: Alexander Golovanov <Sample text>

chore: migrate algebra, ecc, pairing extensions (#2030)

0bb4350

chore: migrate SDK CUDA extensions (#2029)

fb00c98

chore: migrate riscv test vectors (#2032)

9fdb765

Relates to INT-4744 update workflow and test function to run riscv test vectors on gpu --------- Co-authored-by: Jonathan Wang <[email protected]>

chore(ci): add runson runner and cleanup yml (#2035)

3e9dda0

fix: test_moduli_setup (#2039)

d13c692

Resolves INT-4845

chore: CUDA benchmarks and other CI (#2033)

f6788b8

Resolves INT-4699 --------- Co-authored-by: Jonathan Wang <[email protected]>

ci: consolidate extension tests on one gpu machine (#2042)

8a75e23

chore(cuda): cpu gpu types cleanup (#2041)

ba16ce7

Closes INT-4844 - [x] extensions - [x] sdk - [x] benchmarks - [x] guest-libs --------- Co-authored-by: Jonathan Wang <[email protected]>

Merge branch 'main' into feat/tracegen-gpu

c67dcd8

chore(sdk): re-export DefaultStarkEngine (#2045)

f0fb658

chore(ci): consolidate guest lib tests to less gpu runners (#2044)

a0d2a6b

chore(cuda): native recursion tests (#2043)

5604de4

Resolves INT-4847 --------- Co-authored-by: Alexander Golovanov <Sample text> Co-authored-by: Jonathan Wang <[email protected]>

This comment has been minimized.

Sign in to view

jonathanpwang marked this pull request as ready for review August 24, 2025 08:59

Copilot AI review requested due to automatic review settings August 24, 2025 08:59

Copilot AI reviewed Aug 24, 2025

View reviewed changes

crates/vm/src/system/cuda/merkle_tree/mod.rs Show resolved Hide resolved

crates/vm/src/arch/testing/cuda.rs Show resolved Hide resolved

ci: add cuda feature to benchmarks (#2047)

be3bbb9

jonathanpwang merged commit 7cce464 into main Aug 24, 2025
46 checks passed

jonathanpwang deleted the feat/tracegen-gpu branch August 24, 2025 16:44

jonathanpwang restored the feat/tracegen-gpu branch August 24, 2025 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: tracegen on GPU #2034

feat: tracegen on GPU #2034

Uh oh!

jonathanpwang commented Aug 21, 2025

Uh oh!

codspeed-hq bot commented Aug 24, 2025 •

edited

Loading

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

feat: tracegen on GPU #2034

feat: tracegen on GPU #2034

Uh oh!

Conversation

jonathanpwang commented Aug 21, 2025

Uh oh!

codspeed-hq bot commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed WallTime Performance Report

Merging #2034 will degrade performances by 11.29%

Summary

Benchmarks breakdown

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

codspeed-hq bot commented Aug 24, 2025 •

edited

Loading