Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@jonathanpwang
Copy link
Contributor

No description provided.

gaxiom and others added 30 commits August 19, 2025 08:56
* histogram ready and tested

* var_range tracegen

* half of the test with dummy chip

* new tracegen with warp primitives

* Buffer -> Matrix

* DeviceProofInput

* test passed

* -1 allocation
* feat: RangeTupleChecker tracegen

* Addressed PR comments

* Addressed more PR comments

---------

Co-authored-by: Christian Altamirano <bdiehs>
* chore: WIP on is_equal CUDA tracegen impl

* feat: CUDA tracegen of is_equal, WIP: tests

* fix: extensive testing for is_equal

* fix: reimplemented is_equal as a helper __device__ function

* feat: added is_equal_array, WIP: tests

* WIP: fixing tests

* feat: is_equal and is_equal_array with tests

* chore: resolved pr comments

* chore: resolved iterator PR comments

* fix: inputs to subairs are now Fp, fixed is_equal and is_zero subair structure
- Moves cuda/kernels/backend into crates/backend by a) moving CUDA files into crates/backend/cuda and b) moving wrapper .rs files to their appropriate locations in crates/backend/src
- Moves cuda/kernels/tracegen into crates/tracegen analogously to above
- Add .clang-format file and a script to generate .clangd, which allows devs to use Intellisense properly
- Move cuda/fields and cuda/utils into crates/backend/cuda/include and crates/backend/src respectively
- Add tracegen documentation (i.e. README and comments)
* auipc chip init

* cpu test passed

* auipc trace generated

* review-based changes
* feat: deviceBuffer fill zero

* all histograms are zeroed
* new-execution-e4 > new-execution

* return back benchmarks

* test fix

* kitchen_sink fix

* disable halo2 in tests

* larger machine for kitchen_sink
* chore: WIP on is_equal CUDA tracegen impl

* feat: CUDA tracegen of is_equal, WIP: tests

* fix: extensive testing for is_equal

* fix: reimplemented is_equal as a helper __device__ function

* feat: added is_equal_array, WIP: tests

* WIP: fixing tests

* feat: is_equal and is_equal_array with tests

* chore: resolved pr comments

* WIP: poseidon2 cuda tracegen

* WIP: poseidon2 cuda tracegen - missing tests

* fix: minor reference fixes

* WIP: poseidon2 tracegen, need to debug, and fix linear layers and find what round constants are used

* fix, wip: changed memory layout for tracegen

* feat: GPU tracegen matches CPU tracegen - need to cleanup code for PR

* feat: poseidon2 cuda tracegen + tests

* chore: cleaned up code

* chore: resolved PR comments with weak definitions

* chore: hardcoded constants into backend header

* chore: renamed header to constants

* chore: changed poseidon2 tracegen input to be rowmaj, refactored test
* Cuda tracegen + tests for Rv32HintStore

* fix: MemoryWriteAuxAdapter

* fill zero for some fields

* reviewer comments
* WIP: jalr adapter

* feat: finished JALR core and adapter, waiting on GPU harness for tests

* chore: renaming fix

* Cuda tracegen + tests for Rv32HintStore

* fix: MemoryWriteAuxAdapter

* fill zero for some fields

* resolved PR

---------

Co-authored-by: Arayi <[email protected]>
* feat: init Rv32MultAdapterChip tracegen

* feat: mul chip + tests (passing)

* refactor: move mod.rs + tests into one file

* refactor: address pr comments

* chore: remove stray constant

* refactor: incorporate cuda.rs

* refactor: move auipc test to cuda.rs
* WIP: jalr adapter

* feat: finished JALR core and adapter, waiting on GPU harness for tests

* chore: renaming fix

* feat: jal_lui core + adapter cuda tracegen, no tests

* fix: minor import fixes

* Cuda tracegen + tests for Rv32HintStore

* fix: MemoryWriteAuxAdapter

* fill zero for some fields

* fix: removed mem aux

* chore: resolved PR comments

---------

Co-authored-by: Arayi <[email protected]>
* Cuda tracegen + tests for Rv32HintStore

* fix: MemoryWriteAuxAdapter

* fill zero for some fields
* feat: init rv32im less than tracegen impl (broken)

* feat: full less_than and base alu tracegen, test passing

* chore: make tests less verbose

* fix: use set_trace_buffer_height for dense chip

* feat: init rv32im shift tracegen (broken)

* refactor: remove duplicate imports

* fix: make test compile

* refactor: readability

* feat: use various opcodes in test

* feat: test both SLT and SLTU

* chore: revert trace comparing util function

* chore: clean up unused imports

* chore: delete old test file

* feat: use generic test harness

* feat: use generic test harness

* chore: clean up unused imports

* feat: bring over less than test

* refactor: some pr comments + proper test setup

* refactor: pr comments

* refactor: pr comments

* chore: revert mul fixes, do them in other branch

* refactor: minor pr comments

* fix: debug

* fix: try old method for zeroing out extra rows again

* fix: pass width argument to zero out extra rows correctly

* fix: make test actually match CPU equivalent

* chore: remove excessive imports

* feat: rv32im ALU chip tracegen (#104)

* feat: init rv32im alu chip tracegen

* fix: make test actually match CPU equivalent
* WIP: jalr adapter

* feat: finished JALR core and adapter, waiting on GPU harness for tests

* chore: renaming fix

* wip: blt tracegen, kernel done

* Cuda tracegen + tests for Rv32HintStore

* fix: MemoryWriteAuxAdapter

* fill zero for some fields

* feat: blt tracegen + tests

* chore: style

* feat: beq tracegen + tests

* chore: minor fixes

* fix: removed memwrite memeread aux adapters

* chore: resolved PR comments, optimized code a bit

---------

Co-authored-by: Arayi <[email protected]>
* wip: mulh tracegen

* feat: mulh tracegen + tests

* chore: small fix

* fix: minor import fixes from OpenVM

* fix: pass in tuple size by value

* fix: rangetuple

* test: remove rangetuple

* fix: initialize run to zero

* fix: pass in range tuple by value
* Cuda tracegen + tests for Rv32DivRem

* review comments
* chore: lint workflow CI + codespell ignore file

* chore: codespell fixes pt. 1

* chore: clippy fixes pt. 1

* chore: lints.yml revert

* chore: lints workflow working directory

* chore: rust fmt and cargo clippy fixes

* chore: rebase lints

* chore: linter needs to run on GPU-compatible device

* chore: custom GPU image needs to install codespell

* chore: try non-custom image

* chore: try docker install

* chore: separate clippy to different job

* chore: rename lints jobs
* cuda tracegen + tests for castf

* cuda tracegen + test for native branch eq
* fix: write random values to tester for mul

* feat: use rangetuple checker in fill_trace_row

* fix: debug

* fix: revert debug

* fix: fill range checker with zeros

* fix: make tester.execute actually match CPU

* fix: pass range tuple sizes by value

* fix: [debug] revert range tuple checker, all arguments as u32

* fix: [debug] revert d_records to u8

* fix: use device buffer for range tuple sizes

* fix: use UInt2 for range tuple sizes

* chore: lint
* Cuda tracegen + tests for Rv32HintStore

* fix: MemoryWriteAuxAdapter

* fill zero for some fields

* cuda tracegen and tests for load sign extend

* cuda tracegen + tests for loadstore

* fix the loadstore tests and add volatile constructor for GpuTestBuilder

* fix merge

* fix lints

* cuda tracegen for rv32divrem

* fix size buffer

* reviewer comments

* remove unnecessary diff

* review comments

* cuda tracegen + tests for castf

* remove unnecessary dependency

* cuda tracegen + test for native branch eq

* lints

* feat: init native field arithmetic tracegen impl

* feat: impl alu_native_adapter tracegen (broken)

* lints

* fix: make trace match

* refactor: readability

* refactor: lint

* chore: remove unnecessary import

* refactor: format with fmt

* refactor: minor pr comments

* feat: impl generic MemoryWriteAuxRecord instead of bytes

---------

Co-authored-by: Arayi <[email protected]>
jonathanpwang and others added 19 commits August 19, 2025 08:59
Towards INT-4744.

This moves the files from `tracegen/{cuda/,}src/system` to somewhere in
openvm. This currently compiles both with and without `--features cuda`
(well on a machine without cuda it won't compile with `--features
cuda`). The tests don't compile, but it's because the testing utilities
are missing.

---
todo list:

- [x] move stuff to `cuda/system/`
- [x] make tests compile (except the "undefined test utility" thing)
- [x] feature gate cuda dependencies
- [x] feature gate build script

---------

Co-authored-by: Alexander Golovanov <Sample text>
Towards INT-4700
Migrates GpuTestBuilder with related testing cuda files, made a new
trait called `TestBuilder` to be used in `set_and_execute` functions to
make them general.
Moved `memory`, `phantom`, and `public` values system gpu tests to the
corresponding cpu test files so that they can share some code
towards INT-4744

---------

Co-authored-by: Alexander Golovanov <Sample text>
Relates to INT-4744

update workflow and test function to run riscv test vectors on gpu

---------

Co-authored-by: Jonathan Wang <[email protected]>
Relates to INT-4744

- [x] - guest-libs/ruint
- [x] - guest-libs/keccak
- [x] - guest-libs/sha
- [x] - guest-libs/k256
- [x] - guest-libs/p256
- [x] - guest-libs/pairing
- [x] - guest-libs/ff_derive
- [x] - guest-libs/verify_stark

---------

Co-authored-by: stephenh-axiom-xyz <[email protected]>
Co-authored-by: Jonathan Wang <[email protected]>
Resolves INT-4699

---------

Co-authored-by: Jonathan Wang <[email protected]>
Closes INT-4844

- [x] extensions 
- [x] sdk
- [x] benchmarks 
- [x] guest-libs

---------

Co-authored-by: Jonathan Wang <[email protected]>
@codspeed-hq
Copy link

codspeed-hq bot commented Aug 24, 2025

CodSpeed WallTime Performance Report

Merging #2034 will degrade performances by 11.29%

Comparing feat/tracegen-gpu (be3bbb9) with main (fd362bc)

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

⚡ 1 improvements
❌ 1 regressions
✅ 28 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
benchmark_execute[bubblesort] 21.5 ms 24.2 ms -11.29%
benchmark_execute[sha256_iter] 60.5 ms 54.3 ms +11.26%

Resolves INT-4847

---------

Co-authored-by: Alexander Golovanov <Sample text>
Co-authored-by: Jonathan Wang <[email protected]>
@github-actions

This comment has been minimized.

@jonathanpwang jonathanpwang marked this pull request as ready for review August 24, 2025 08:59
Copilot AI review requested due to automatic review settings August 24, 2025 08:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces GPU-accelerated trace generation capabilities to the OpenVM zero-knowledge proof system, enabling CUDA-based trace generation for various system components.

Key changes:

  • Adds CUDA kernel bindings and GPU chip implementations for system components (memory, phantom, public values, etc.)
  • Implements hybrid CPU/GPU chip architecture with specialized GPU trace generation
  • Adds comprehensive test infrastructure for GPU vs CPU trace equivalence validation

Reviewed Changes

Copilot reviewed 157 out of 392 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
extensions/algebra/circuit/build.rs Adds CUDA build configuration for algebra circuit extension
extensions/algebra/circuit/Cargo.toml Adds CUDA feature dependencies and build requirements
crates/vm/src/utils/stark_utils.rs Implements conditional GPU/CPU engine selection based on CUDA feature
crates/vm/src/system/*/tests.rs Adds GPU trace generation tests with CPU equivalence validation
crates/vm/src/system/cuda/* Implements GPU-accelerated system components and CUDA kernel interfaces
crates/vm/src/arch/testing/* Adds GPU testing infrastructure and hybrid chip test harnesses
Comments suppressed due to low confidence (2)

crates/vm/src/system/cuda/memory.rs:29

  • [nitpick] The comment mentions taking 'extra care not to use memory we don't own' but doesn't specify what precautions are actually taken. Consider documenting the specific safety measures or ownership constraints.
    pub boundary: BoundaryChipGPU,

crates/vm/src/system/cuda/boundary.rs:202

  • The TODO suggests avoiding a copy operation which could impact performance. Consider implementing a zero-copy approach or using move semantics to eliminate the unnecessary copy.
    use rand::Rng;

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@github-actions
Copy link

group app.proof_time_ms app.cycles app.cells_used leaf.proof_time_ms leaf.cycles leaf.cells_used
verify_fibair (-1810 [-86.1%]) 293 322,610 (-16691166 [-89.0%]) 2,058,654 - - -
fibonacci (-1347 [-56.9%]) 1,020 1,500,210 (-50444275 [-97.9%]) 1,060,232 - - -
regex (-4497 [-60.0%]) 2,997 4,108,586 (-151328688 [-91.9%]) 13,406,304 - - -
ecrecover (-224 [-16.1%]) 1,169 140,497 (-6591598 [-74.3%]) 2,275,056 - - -
pairing (-2228 [-57.7%]) 1,632 1,882,939 (-75111671 [-76.0%]) 23,722,622 - - -

Commit: be3bbb9

Benchmark Workflow

@jonathanpwang jonathanpwang merged commit 7cce464 into main Aug 24, 2025
46 checks passed
@jonathanpwang jonathanpwang deleted the feat/tracegen-gpu branch August 24, 2025 16:44
@jonathanpwang jonathanpwang restored the feat/tracegen-gpu branch August 24, 2025 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants