Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FEAT: DatasetConfiguration Refactor#2071

Open
rlundeen2 wants to merge 4 commits into
microsoft:mainfrom
rlundeen2:rlundeen2-redesign-dataset-configuration
Open

FEAT: DatasetConfiguration Refactor#2071
rlundeen2 wants to merge 4 commits into
microsoft:mainfrom
rlundeen2:rlundeen2-redesign-dataset-configuration

Conversation

@rlundeen2

@rlundeen2 rlundeen2 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

DatasetConfiguration as is worked fairly well for the first scenarios. However, it ran into several issues as we added more. garak.encoding needed seedPrompts; jailbreak needed two types of datasets (both the harms and datasets themselves). Psychosocial had datasets tied to techniques. And other Garak scenarios need different more flexible types. This PR refactors DatasetConfiguration so that it can be a better fit for these diverse scenarios.

Additionally (one big addition) is that if the Dataset isn't in memory, it will use the DatasetProvider to fetch the dataset and put it in memory. This means you don't need the load_default_datasets initializer on startup unless you want to preload things.

This refactors DatasetConfiguration from a single catch-all class into a generic base plus typed subclasses (DatasetObjectiveConfiguration, DatasetPromptConfiguration, and DatasetAttackConfiguration, the default most scenarios use). Constraints are now expressed through composable validators that run against the fully resolved dataset before max_dataset_size sampling, so they describe the dataset itself rather than a sampled subset. Each resolved dataset carries a DatasetSourceKind (inline vs. memory), which lets a scenario require or forbid inline seeds — useful for CLI flags such as --objectives. Non-emptiness is enforced as a default validator on every configuration, and typed subclasses layer a seed-type check on top.

Resolution is also more predictable: when a configured dataset name is not yet in memory and auto_fetch is set, it is fetched from the registered SeedDatasetProvider on demand, and any failure now raises loudly with the chained root cause instead of silently warning. The redundant pre-pass that eagerly pre-populated memory has been removed in favor of this on-demand path, and all scenario callers and tests have been migrated to the new methods (get_seeds_async, get_seed_attack_groups_async, get_attack_groups_by_dataset_async). Legacy getters remain but emit deprecation warnings. All scenario unit tests pass.

Restructure DatasetConfiguration into a generic base plus typed subclasses
(DatasetObjectiveConfiguration, DatasetPromptConfiguration,
DatasetAttackConfiguration). Constraints are expressed through composable
validators run against the fully resolved dataset (pre-sampling), and the
resolved set carries a DatasetSourceKind (inline vs memory) so validators can
require or forbid inline seeds. Auto-fetch missing datasets from the provider
on demand and raise loudly (with chained root cause) instead of silently
warning. Non-emptiness is enforced as a default validator on every config.

Co-authored-by: Copilot <[email protected]>
@rlundeen2 rlundeen2 changed the title MAINT: DatasetConfiguration Refactor FEAT: DatasetConfiguration Refactor Jun 22, 2026
rlundeen2 and others added 3 commits June 22, 2026 15:44
Scenarios now fetch their datasets from the registered provider on demand
the first time they run, so the load_default_datasets initializer is no
longer required for everyday runs or in the recommended default config.
Remove it from the default config examples and per-run scanner command
examples, and add a note explaining it is now an optional preload step
(useful for warming memory or populating a database for offline use).

Co-authored-by: Copilot <[email protected]>
Add a dataset_names tuple to ResolvedDataset plus a restrict_dataset_names
validator so scenarios that pair techniques with specific datasets (e.g.
psychosocial, jailbreak) can constrain which datasets a configuration may
draw from.

Replace the EXPLICIT_SEED_GROUPS_KEY dict-overload with an explicit
inline-vs-named split. Named resolution (_collect_named_seeds_async) now
returns only real dataset names, and get_seeds_async resolves inline data
through a dedicated branch, removing the reserved-key collision guards. Inline
data keeps a single honest INLINE_DATASET_NAME label in the by-dataset views
(used for atomic-attack naming), so user-facing labels read 'technique_inline'
instead of leaking the old sentinel.

Co-authored-by: Copilot <[email protected]>
Remove DatasetObjectiveConfiguration, DatasetPromptConfiguration, and the
generic DatasetConfiguration[SeedT] flat resolver (get_seeds_async). None had
production callers; the objectives-only constraint that motivated the typed
subclasses is enforced at runtime per-technique, not at the dataset level.

DatasetConfiguration is now a plain (non-generic) base of resolution/fetch/
validate/sample plumbing plus the deprecated legacy getters, and
DatasetAttackConfiguration remains the one concrete resolver scenarios use.
Tests that exercised the removed flat resolver now drive the same base
resolution through DatasetAttackConfiguration.

Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant