Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add recipe for Qwen2-Audio-7B-Chat on Dynamic-SUPERB ASR task#6194

Merged
sw005320 merged 51 commits intoespnet:masterfrom
cyhuang-tw:dynamic-superb
Sep 10, 2025
Merged

Add recipe for Qwen2-Audio-7B-Chat on Dynamic-SUPERB ASR task#6194
sw005320 merged 51 commits intoespnet:masterfrom
cyhuang-tw:dynamic-superb

Conversation

@cyhuang-tw
Copy link
Contributor

What did you change?

Added a recipe for running Qwen2-Audio-7B-Chat on the ASR task in Dynamic-SUPERB Phase-2.


Why did you make this change?

To provide a ready-to-use baseline for evaluating spoken language models on Dynamic-SUPERB.


Is your PR small enough?

No. This PR modifies 30 files with around 1,360 lines added, primarily within the new recipe directory.


Additional Context

Qwen2-Audio-7B-Chat: [paper], [GitHub]
Dynamic-SUPERB Phase-2: [paper], [GitHub]

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jul 28, 2025
@mergify mergify bot added the ESPnet2 label Jul 28, 2025
@dosubot dosubot bot added ASR Automatic speech recogntion Recipe labels Jul 28, 2025
@sw005320 sw005320 added this to the v.202506 milestone Jul 28, 2025
@sw005320 sw005320 requested a review from Copilot July 28, 2025 07:35

This comment was marked as outdated.

@cyhuang-tw
Copy link
Contributor Author

Thanks for the suggestion! I've completed most of it, now working on the test part.

@Fhrozen
Copy link
Member

Fhrozen commented Aug 27, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new recipe for Qwen2-Audio-7B-Chat on the Dynamic-SUPERB ASR task. This is a significant contribution, including a new PS2ST (prompt-based speech-to-text) task, a model wrapper for Qwen2-Audio, a custom scorer, a tokenizer, and a preprocessor, along with the necessary recipe scripts. The overall structure is sound and follows ESPnet conventions. I've identified two critical issues that should be addressed. One is related to inefficient and unsafe use of temporary files for audio processing, which can be replaced by direct in-memory processing. The other is a dangerous use of a broad except Exception: pass block that can hide critical errors during decoding. Fixing these will improve the robustness and performance of this new integration.

Comment on lines +68 to +75
# Fallback to read from KV if step got desynced
if isinstance(past_kv, tuple) and len(past_kv) > 0 and len(past_kv[0]) > 0:
try:
kv_len = past_kv[0][0].size(-2)
if kv_len != past_len:
past_len = kv_len
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The except Exception: pass block is dangerous as it will silently ignore any and all errors that occur within the try block. This can lead to incorrect decoding results or hide serious bugs in the key-value cache handling. The preceding if condition should be sufficient to guard against invalid past_kv structures. If an error still occurs, it should be allowed to fail loudly so it can be debugged and fixed.

        if isinstance(past_kv, tuple) and len(past_kv) > 0 and len(past_kv[0]) > 0:
            kv_len = past_kv[0][0].size(-2)
            if kv_len != past_len:
                past_len = kv_len

Comment on lines +67 to +134
self._build_processor()

def create_tempfile(audio_input: Tuple[List[np.ndarray], int]) -> List[str]:
n_audios, sr = len(audio_input[0]), audio_input[1]
tempfiles = [
tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
for _ in range(n_audios)
]
for temp, wav in zip(tempfiles, audio_input[0]):
sf.write(temp.name, wav, sr)
return [tempfile.name for tempfile in tempfiles]

def delete_tempfile(p_files: List[str]):
for p_file in p_files:
os.unlink(p_file)

# Handle audio input if provided
if audio_input is not None:
temp_files = create_tempfile(audio_input)

try:
# Prepare multimodal query
wavs_query = [
{"type": "audio", "audio_url": file}
for file in temp_files
if os.path.exists(file)
]

text_query = [{"type": "text", "text": text_input}]

query = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": wavs_query + text_query},
]

# Apply chat template
text = self.processor.apply_chat_template(
query, add_generation_prompt=True, tokenize=False
)

# Load audio files
audios = [
librosa.load(
file, sr=self.processor.feature_extractor.sampling_rate
)[0]
for file in temp_files
if os.path.exists(file)
]

# Process inputs with both text and audio
inputs = self.processor(
text=text,
audios=audios,
sampling_rate=self.processor.feature_extractor.sampling_rate,
return_tensors="np",
padding=True,
)
delete_tempfile(temp_files)
except Exception as e:
# Make sure to delete temp files if error occurs.
print(f"Error processing audio input: {e}")
delete_tempfile(temp_files)
raise e
else:
# Text-only processing
inputs = self.processor(text=text_input, return_tensors="np", padding=True)

return inputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation of create_multimodal_query uses temporary files to process audio data. This is inefficient due to disk I/O and can be a point of failure (e.g., disk full, permissions). It's also unnecessarily complex, as the transformers processor can handle in-memory numpy arrays directly. This can be simplified to pass the audio arrays directly to the processor, which improves performance and robustness.

        self._build_processor()

        # Handle audio input if provided
        if audio_input is not None:
            audios, sr = audio_input
            # The audio URLs are just placeholders for the chat template.
            # The actual audio data is passed in the `audios` argument below.
            wavs_query = [
                {"type": "audio", "audio_url": f"placeholder_{i}.wav"}
                for i in range(len(audios))
            ]

            text_query = [{"type": "text", "text": text_input}]

            query = [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": wavs_query + text_query},
            ]

            # Apply chat template
            text = self.processor.apply_chat_template(
                query, add_generation_prompt=True, tokenize=False
            )

            # Process inputs with both text and audio
            inputs = self.processor(
                text=text,
                audios=audios,
                sampling_rate=sr,
                return_tensors="np",
                padding=True,
            )
        else:
            # Text-only processing
            inputs = self.processor(text=text_input, return_tensors="np", padding=True)

        return inputs

@mergify
Copy link
Contributor

mergify bot commented Aug 27, 2025

This pull request is now in conflict :(

@mergify mergify bot added the conflicts label Aug 27, 2025
@mergify mergify bot removed the conflicts label Aug 27, 2025
@sw005320
Copy link
Contributor

@cyhuang-tw, I think your errors seem to be reproducible.
The CI cannot tell anything (probably due to some errors).
Can you test it locally and debug it?

@sw005320
Copy link
Contributor

@cyhuang-tw, again, this PR seems to have an issue in test python, which shows different behaviours than the other PRs. Please run it locally and find some issues.

@cyhuang-tw
Copy link
Contributor Author

cyhuang-tw commented Aug 29, 2025

Yes, I am running test locally.
(I've identified some issues on my local machine, and I am trying to fix them.)

@mergify mergify bot added the Installation label Aug 31, 2025
@sw005320 sw005320 merged commit e8c13a6 into espnet:master Sep 10, 2025
31 of 32 checks passed
@sw005320
Copy link
Contributor

Thanks a lot!
Finally!

@Fhrozen Fhrozen mentioned this pull request Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ASR Automatic speech recogntion ESPnet2 Installation Recipe size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants