FEAT capability-aware multimodal feedback loop for Crescendo/RedTeaming/TAP#1377
Open
fitzpr wants to merge 15 commits into
Open
FEAT capability-aware multimodal feedback loop for Crescendo/RedTeaming/TAP#1377fitzpr wants to merge 15 commits into
fitzpr wants to merge 15 commits into
Conversation
When the objective target returns non-text content (images, video, etc.), the adversarial chat now receives a multimodal message containing both the scorer's textual feedback AND the actual generated media. This enables vision-capable adversarial LLMs (e.g. GPT-4o) to see what the target produced and craft more informed follow-up prompts. Changes: - _handle_adversarial_file_response: returns (feedback_text, media_piece) tuple instead of just the feedback string - _build_adversarial_prompt: returns Union[str, tuple] to propagate media - _generate_next_prompt_async: constructs multimodal Message with text + media pieces when file response detected; text-only path unchanged Tests: - Updated 2 existing tests for new tuple return type - Added 5 new tests in TestMultimodalFeedbackLoop: - image response produces multimodal message to adversarial chat - video response produces multimodal message to adversarial chat - text response stays text-only (no regression) - _build_adversarial_prompt returns tuple for image - _build_adversarial_prompt returns str for text
When a target response has data_type='error' (e.g. content filter block), treat it as text in OpenAIChatTarget's multimodal message builder instead of raising ValueError. This prevents crashes when conversation history contains error responses from prior turns.
romanlutz
reviewed
Feb 19, 2026
Co-authored-by: Roman Lutz <[email protected]>
fitzpr
pushed a commit
to fitzpr/PyRIT
that referenced
this pull request
Feb 19, 2026
- Add SUPPORTED_INPUT_MODALITIES class attribute to PromptTarget base class - Add input_modality_supported() and supports_multimodal_input() methods - Add supported_input_modalities property that returns list of supported modalities - Add supported_input_modalities and supports_conversation_history fields to TargetIdentifier - Update PromptTarget._create_identifier() to populate new fields - Implement modality declarations in OpenAIChatTarget (text, image_path), TextTarget (text), and HuggingFaceChatTarget (text) - Add comprehensive tests for modality support detection This system enables attacks to detect whether targets support multimodal input (text + other modalities) and route accordingly, addressing the limitation mentioned in PR microsoft#1377 where multimodal attacks need to know target capabilities.
Address Roman's feedback items #2 and #3: - Change _build_adversarial_prompt to return Message instead of Union type - Extract message construction logic into separate helper methods - Add _build_text_message() for simple text prompts - Add _build_multimodal_message() for media responses - Simplify caller code by removing tuple handling logic - Improve logging to work with Message objects These architectural improvements prepare the code to integrate with the modality support detection system from separate PR.
fitzpr
pushed a commit
to fitzpr/PyRIT
that referenced
this pull request
Feb 19, 2026
- Add SUPPORTED_INPUT_MODALITIES class attribute to PromptTarget base class - Add input_modality_supported() and supports_multimodal_input() methods - Add supported_input_modalities property that returns list of supported modalities - Add supported_input_modalities and supports_conversation_history fields to TargetIdentifier - Update PromptTarget._create_identifier() to populate new fields - Implement modality declarations in OpenAIChatTarget (text, image_path), TextTarget (text), and HuggingFaceChatTarget (text) - Add comprehensive tests for modality support detection This system enables attacks to detect whether targets support multimodal input (text + other modalities) and route accordingly, addressing the limitation mentioned in PR microsoft#1377 where multimodal attacks need to know target capabilities.
fitzpr
pushed a commit
to fitzpr/PyRIT
that referenced
this pull request
Feb 19, 2026
- Add SUPPORTED_INPUT_MODALITIES class attribute to PromptTarget base class - Add input_modality_supported() and supports_multimodal_input() methods - Add supported_input_modalities property that returns list of supported modalities - Add supported_input_modalities and supports_conversation_history fields to TargetIdentifier - Update PromptTarget._create_identifier() to populate new fields - Implement modality declarations in OpenAIChatTarget (text, image_path), TextTarget (text), and HuggingFaceChatTarget (text) - Add comprehensive tests for modality support detection This system enables attacks to detect whether targets support multimodal input (text + other modalities) and route accordingly, addressing the limitation mentioned in PR microsoft#1377 where multimodal attacks need to know target capabilities.
fitzpr
pushed a commit
to fitzpr/PyRIT
that referenced
this pull request
Feb 20, 2026
Addresses all Roman's feedback from PR microsoft#1377: - Uses set[frozenset[PromptDataType]] instead of tuples - Exact frozenset matching prevents ordering issues - Implemented across all target types (OpenAI, HuggingFace, TextTarget) - Future-proof pattern matching for new OpenAI models - Optional verification utility for runtime testing - Comprehensive test suite with 8 passing tests
The objective target's TargetCapabilities are now the single source of truth for whether prior media (image, audio, video) is forwarded between the adversarial chat and the objective target across all multi-turn attacks. A shared ModalityFeedbackRouter is composed into RedTeamingAttack, CrescendoAttack, TreeOfAttacksWithPruningAttack, and PAIRAttack. It decides per turn whether to attach prior response media on either side based on each target's declared input_modalities, and fills MessagePiece.adversarial_placeholder() slots in AttackParameters.next_message so callers can mix seed media (e.g. a base image to edit) with adversarial-generated text on turn 1. Three usage scenarios fall out naturally: * default (target advertises text-to-image and text+image-to-image): turn 1 sends text only, turns 2+ pass the previous image back along with adversarial text; * text-to-image only (narrow the target's input_modalities via custom_configuration): every turn is text-only; * image-editing only (narrow the target's input_modalities to text+image, pass next_message=Message([MessagePiece.adversarial_placeholder(), seed_image])): turn 1 sends adversarial text plus seed, turns 2+ refine the previous image. Same logic is generic across image_path / audio_path / video_path. Notes: * The PR branch's history was disjoint from current origin/main (an orphaned past commit). Branch was reset to origin/main and the feature rebuilt on top; the prior PR-only red_teaming helpers are superseded by the router, and the prior changes to refusal scorer YAMLs / video target / message.py are already in main via separate merges. * TreeOfAttacksWithPruningAttack._TreeOfAttacksNode.last_response widens from Optional[str] to Optional[Message] so the router can introspect the data_type of prior pieces; readers were updated accordingly. Co-authored-by: Copilot <[email protected]>
Contributor
|
@fitzpr heads up I'll make some substantial updates here shortly. |
- Reset doc/ to match origin/main (flat numbered notebook structure) - Remove old attack/, workflow/, benchmark/, promptgen/ subdirectory notebooks - Add doc/code/executor/8_modality_feedback.py/.ipynb: two-seed Crescendo modality-feedback example (roakey + sailboat, hybrid capability profile) - Update 0_executor.md and myst.yml to include notebook microsoft#8 in navigation Co-authored-by: Copilot <[email protected]>
…icter scorer for multi-turn demo - Section 6 now uses IPyImage(data=bytes) to embed all images directly in the notebook so they render without re-running (no more unresolvable paths). - Replaced custom adversarial system_prompt with SeedPrompt loaded from the built-in crescendo/image_generation.yaml, which has proper multi-turn escalation (starts simple, builds up) forces 2-4 turns instead of 1. - Fixed image_generation.yaml JSON response keys: renamed generated_question -> next_message and rationale_behind_jailbreak -> rationale to match what CrescendoAttack expects. - Tightened SelfAskTrueFalseScorer true_description to require ALL five visual elements simultaneously, making single-turn success unlikely. - Added EXECUTOR_SEED_PROMPT_PATH and SeedPrompt imports. - Removed unused MarkdownConversationMemoryPrinter and IPythonMarkdownSink. Co-authored-by: Copilot <[email protected]>
…loop-v2' into feature/media-feedback-loop-v2
…l naming - Tighten modality notebook objective + scorer criteria to preserve the seeded raccoon identity. - Regenerate 8_modality_feedback.ipynb outputs from the updated notebook source. - Strengthen Crescendo image_generation guidance for seeded non-human anchors and aligned rationale key naming. - Rename ModalityFeedbackRouter constructor keyword from adversarial_target to adversarial_chat. - Rename property objective_requires_media_on_first_turn to objective_target_requires_media_on_first_turn. - Update all affected multi-turn attack callsites and unit tests. Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR makes multimodal feedback routing capability-aware across multi-turn attacks so media can be forwarded end-to-end when targets support it.
Core behavior
ModalityFeedbackRoutercomponent for multi-turn attacks.TargetCapabilities.input_modalitiesto decide when media should be forwarded.Attack integrations
Applied the router-driven logic to:
CrescendoAttackRedTeamingAttackTreeOfAttacksWithPruningAttack(TAP)What this enables
{text, <media_type>}.{text, <media_type>}.Notebook / docs
Added and refreshed the multimodal executor demo:
doc/code/executor/8_modality_feedback.pydoc/code/executor/8_modality_feedback.ipynbdoc/code/executor/assets/three_masted_ship_color.jpgNotebook now demonstrates:
Prompt-template updates
pyrit/datasets/executors/crescendo/image_generation.yaml:next_message,rationale,last_response_summary),Naming/API consistency
For consistency with existing attack terminology:
ModalityFeedbackRouter(..., adversarial_target=...)➜adversarial_chat=...objective_requires_media_on_first_turn➜objective_target_requires_media_on_first_turnTests
Updated/added tests covering router behavior and multi-turn integration:
tests/unit/executor/attack/component/test_modality_router.pytests/unit/executor/attack/multi_turn/test_crescendo.pytests/unit/executor/attack/multi_turn/test_red_teaming.pytests/unit/executor/attack/multi_turn/test_supports_multi_turn_attacks.pytests/unit/executor/attack/multi_turn/test_tree_of_attacks.pytests/unit/executor/attack/test_attack_parameter_consistency.pytests/unit/models/test_message_piece.py