Codestin Search App

rlundeen2 · 2026-06-22T22:07:11Z

DatasetConfiguration as is worked fairly well for the first scenarios. However, it ran into several issues as we added more. garak.encoding needed seedPrompts; jailbreak needed two types of datasets (both the harms and datasets themselves). Psychosocial had datasets tied to techniques. And other Garak scenarios need different more flexible types. This PR refactors DatasetConfiguration so that it can be a better fit for these diverse scenarios.

Additionally (one big addition) is that if the Dataset isn't in memory, it will use the DatasetProvider to fetch the dataset and put it in memory. This means you don't need the load_default_datasets initializer on startup unless you want to preload things.

This refactors DatasetConfiguration from a single catch-all class into a generic base plus typed subclasses (DatasetObjectiveConfiguration, DatasetPromptConfiguration, and DatasetAttackConfiguration, the default most scenarios use). Constraints are now expressed through composable validators that run against the fully resolved dataset before max_dataset_size sampling, so they describe the dataset itself rather than a sampled subset. Each resolved dataset carries a DatasetSourceKind (inline vs. memory), which lets a scenario require or forbid inline seeds — useful for CLI flags such as --objectives. Non-emptiness is enforced as a default validator on every configuration, and typed subclasses layer a seed-type check on top.

Resolution is also more predictable: when a configured dataset name is not yet in memory and auto_fetch is set, it is fetched from the registered SeedDatasetProvider on demand, and any failure now raises loudly with the chained root cause instead of silently warning. The redundant pre-pass that eagerly pre-populated memory has been removed in favor of this on-demand path, and all scenario callers and tests have been migrated to the new methods (get_seeds_async, get_seed_attack_groups_async, get_attack_groups_by_dataset_async). Legacy getters remain but emit deprecation warnings. All scenario unit tests pass.

Restructure DatasetConfiguration into a generic base plus typed subclasses (DatasetObjectiveConfiguration, DatasetPromptConfiguration, DatasetAttackConfiguration). Constraints are expressed through composable validators run against the fully resolved dataset (pre-sampling), and the resolved set carries a DatasetSourceKind (inline vs memory) so validators can require or forbid inline seeds. Auto-fetch missing datasets from the provider on demand and raise loudly (with chained root cause) instead of silently warning. Non-emptiness is enforced as a default validator on every config. Co-authored-by: Copilot <[email protected]>

Scenarios now fetch their datasets from the registered provider on demand the first time they run, so the load_default_datasets initializer is no longer required for everyday runs or in the recommended default config. Remove it from the default config examples and per-run scanner command examples, and add a note explaining it is now an optional preload step (useful for warming memory or populating a database for offline use). Co-authored-by: Copilot <[email protected]>

Add a dataset_names tuple to ResolvedDataset plus a restrict_dataset_names validator so scenarios that pair techniques with specific datasets (e.g. psychosocial, jailbreak) can constrain which datasets a configuration may draw from. Replace the EXPLICIT_SEED_GROUPS_KEY dict-overload with an explicit inline-vs-named split. Named resolution (_collect_named_seeds_async) now returns only real dataset names, and get_seeds_async resolves inline data through a dedicated branch, removing the reserved-key collision guards. Inline data keeps a single honest INLINE_DATASET_NAME label in the by-dataset views (used for atomic-attack naming), so user-facing labels read 'technique_inline' instead of leaking the old sentinel. Co-authored-by: Copilot <[email protected]>

Remove DatasetObjectiveConfiguration, DatasetPromptConfiguration, and the generic DatasetConfiguration[SeedT] flat resolver (get_seeds_async). None had production callers; the objectives-only constraint that motivated the typed subclasses is enforced at runtime per-technique, not at the dataset level. DatasetConfiguration is now a plain (non-generic) base of resolution/fetch/ validate/sample plumbing plus the deprecated legacy getters, and DatasetAttackConfiguration remains the one concrete resolver scenarios use. Tests that exercised the removed flat resolver now drive the same base resolution through DatasetAttackConfiguration. Co-authored-by: Copilot <[email protected]>

rlundeen2 changed the title ~~MAINT: DatasetConfiguration Refactor~~ FEAT: DatasetConfiguration Refactor Jun 22, 2026

rlundeen2 mentioned this pull request Jun 22, 2026

[BREAKING] FEAT: Standardize Jailbreak scenario defaults #2045

Open

rlundeen2 and others added 3 commits June 22, 2026 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: DatasetConfiguration Refactor#2071

FEAT: DatasetConfiguration Refactor#2071
rlundeen2 wants to merge 4 commits into
microsoft:mainfrom
rlundeen2:rlundeen2-redesign-dataset-configuration

rlundeen2 commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rlundeen2 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rlundeen2 commented Jun 22, 2026 •

edited

Loading