FEAT: DatasetConfiguration Refactor#2071
Open
rlundeen2 wants to merge 4 commits into
Open
Conversation
Restructure DatasetConfiguration into a generic base plus typed subclasses (DatasetObjectiveConfiguration, DatasetPromptConfiguration, DatasetAttackConfiguration). Constraints are expressed through composable validators run against the fully resolved dataset (pre-sampling), and the resolved set carries a DatasetSourceKind (inline vs memory) so validators can require or forbid inline seeds. Auto-fetch missing datasets from the provider on demand and raise loudly (with chained root cause) instead of silently warning. Non-emptiness is enforced as a default validator on every config. Co-authored-by: Copilot <[email protected]>
Scenarios now fetch their datasets from the registered provider on demand the first time they run, so the load_default_datasets initializer is no longer required for everyday runs or in the recommended default config. Remove it from the default config examples and per-run scanner command examples, and add a note explaining it is now an optional preload step (useful for warming memory or populating a database for offline use). Co-authored-by: Copilot <[email protected]>
Add a dataset_names tuple to ResolvedDataset plus a restrict_dataset_names validator so scenarios that pair techniques with specific datasets (e.g. psychosocial, jailbreak) can constrain which datasets a configuration may draw from. Replace the EXPLICIT_SEED_GROUPS_KEY dict-overload with an explicit inline-vs-named split. Named resolution (_collect_named_seeds_async) now returns only real dataset names, and get_seeds_async resolves inline data through a dedicated branch, removing the reserved-key collision guards. Inline data keeps a single honest INLINE_DATASET_NAME label in the by-dataset views (used for atomic-attack naming), so user-facing labels read 'technique_inline' instead of leaking the old sentinel. Co-authored-by: Copilot <[email protected]>
Remove DatasetObjectiveConfiguration, DatasetPromptConfiguration, and the generic DatasetConfiguration[SeedT] flat resolver (get_seeds_async). None had production callers; the objectives-only constraint that motivated the typed subclasses is enforced at runtime per-technique, not at the dataset level. DatasetConfiguration is now a plain (non-generic) base of resolution/fetch/ validate/sample plumbing plus the deprecated legacy getters, and DatasetAttackConfiguration remains the one concrete resolver scenarios use. Tests that exercised the removed flat resolver now drive the same base resolution through DatasetAttackConfiguration. Co-authored-by: Copilot <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DatasetConfigurationas is worked fairly well for the first scenarios. However, it ran into several issues as we added more. garak.encoding needed seedPrompts; jailbreak needed two types of datasets (both the harms and datasets themselves). Psychosocial had datasets tied to techniques. And other Garak scenarios need different more flexible types. This PR refactorsDatasetConfigurationso that it can be a better fit for these diverse scenarios.Additionally (one big addition) is that if the Dataset isn't in memory, it will use the DatasetProvider to fetch the dataset and put it in memory. This means you don't need the
load_default_datasetsinitializer on startup unless you want to preload things.This refactors
DatasetConfigurationfrom a single catch-all class into a generic base plus typed subclasses (DatasetObjectiveConfiguration,DatasetPromptConfiguration, andDatasetAttackConfiguration, the default most scenarios use). Constraints are now expressed through composable validators that run against the fully resolved dataset beforemax_dataset_sizesampling, so they describe the dataset itself rather than a sampled subset. Each resolved dataset carries aDatasetSourceKind(inline vs. memory), which lets a scenario require or forbid inline seeds — useful for CLI flags such as--objectives. Non-emptiness is enforced as a default validator on every configuration, and typed subclasses layer a seed-type check on top.Resolution is also more predictable: when a configured dataset name is not yet in memory and
auto_fetchis set, it is fetched from the registeredSeedDatasetProvideron demand, and any failure now raises loudly with the chained root cause instead of silently warning. The redundant pre-pass that eagerly pre-populated memory has been removed in favor of this on-demand path, and all scenario callers and tests have been migrated to the new methods (get_seeds_async,get_seed_attack_groups_async,get_attack_groups_by_dataset_async). Legacy getters remain but emit deprecation warnings. All scenario unit tests pass.