Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FEAT: Adding Garak Remote Datasets#2063

Merged
rlundeen2 merged 3 commits into
microsoft:mainfrom
rlundeen2:rlundeen2-garak-dataset-loader
Jun 22, 2026
Merged

FEAT: Adding Garak Remote Datasets#2063
rlundeen2 merged 3 commits into
microsoft:mainfrom
rlundeen2:rlundeen2-garak-dataset-loader

Conversation

@rlundeen2

Copy link
Copy Markdown
Contributor

PyRIT is porting garak's probing techniques, and many of those techniques depend on reference data (package-name registries, system-prompt libraries, audio jailbreak clips) published under the garak-llm HuggingFace org. This PR adds native seed-dataset loaders for that data so garak techniques can be wired into PyRIT scenarios without bespoke download code.

It introduces a shared _GarakRemoteDataset base (reusing the existing _RemoteDatasetLoader primitives) plus ten registered loaders: seven package-hallucination registries (pypi, npm, crates, rubygems, dart, perl, raku), two system-prompt libraries, and the audio_achilles_heel clip set. All dataset names are prefixed garak_, each row maps to one SeedPrompt with source metadata preserved, and the change ships unit tests (mocked HF data), docs/notebook updates, and the citation.

rlundeen2 and others added 2 commits June 21, 2026 13:22
Adds remote seed-dataset loaders for the datasets hosted under the garak-llm HuggingFace org so garak techniques are easier to use in PyRIT scenarios.

Co-authored-by: Copilot <[email protected]>

@romanlutz romanlutz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair to assume you ran the integration tests for all these?

@rlundeen2

Copy link
Copy Markdown
Contributor Author

Fair to assume you ran the integration tests for all these?

I had, but noticing they are close to the timeout, so I'm adding a param to fetch a smaller number of datasets for the tests.

The garak package registries are multi-million-row lists (npm ~3.3M, pypi ~555k); building every row as a SeedPrompt can exceed the e2e per-test timeout even with cached data. Add an optional keyword-only max_examples (default None = full list) to the garak base loader, honor it in both the text and audio fetch paths, and cap npm/pypi at 6 examples in the e2e suite.

Co-authored-by: Copilot <[email protected]>
Comment thread doc/code/datasets/1_loading_datasets.py
@rlundeen2 rlundeen2 added this pull request to the merge queue Jun 22, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 22, 2026
@rlundeen2 rlundeen2 added this pull request to the merge queue Jun 22, 2026
Merged via the queue into microsoft:main with commit 2068f66 Jun 22, 2026
53 checks passed
@rlundeen2 rlundeen2 deleted the rlundeen2-garak-dataset-loader branch June 22, 2026 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants