Codestin Search App

cyhuang-tw · 2025-07-28T07:22:12Z

What did you change?

Added a recipe for running Qwen2-Audio-7B-Chat on the ASR task in Dynamic-SUPERB Phase-2.

Why did you make this change?

To provide a ready-to-use baseline for evaluating spoken language models on Dynamic-SUPERB.

Is your PR small enough?

No. This PR modifies 30 files with around 1,360 lines added, primarily within the new recipe directory.

Additional Context

Qwen2-Audio-7B-Chat: [paper], [GitHub]
Dynamic-SUPERB Phase-2: [paper], [GitHub]

…R tasks.

for more information, see https://pre-commit.ci

cyhuang-tw · 2025-08-16T14:07:13Z

Thanks for the suggestion! I've completed most of it, now working on the test part.

Fhrozen · 2025-08-27T13:19:22Z

/gemini review

gemini-code-assist

Code Review

This pull request adds a new recipe for Qwen2-Audio-7B-Chat on the Dynamic-SUPERB ASR task. This is a significant contribution, including a new PS2ST (prompt-based speech-to-text) task, a model wrapper for Qwen2-Audio, a custom scorer, a tokenizer, and a preprocessor, along with the necessary recipe scripts. The overall structure is sound and follows ESPnet conventions. I've identified two critical issues that should be addressed. One is related to inefficient and unsafe use of temporary files for audio processing, which can be replaced by direct in-memory processing. The other is a dangerous use of a broad except Exception: pass block that can hide critical errors during decoding. Fixing these will improve the robustness and performance of this new integration.

gemini-code-assist · 2025-08-27T13:21:39Z

espnet2/ps2st/qwen2_scorer.py

+        # Fallback to read from KV if step got desynced
+        if isinstance(past_kv, tuple) and len(past_kv) > 0 and len(past_kv[0]) > 0:
+            try:
+                kv_len = past_kv[0][0].size(-2)
+                if kv_len != past_len:
+                    past_len = kv_len
+            except Exception:
+                pass


The except Exception: pass block is dangerous as it will silently ignore any and all errors that occur within the try block. This can lead to incorrect decoding results or hide serious bugs in the key-value cache handling. The preceding if condition should be sufficient to guard against invalid past_kv structures. If an error still occurs, it should be allowed to fail loudly so it can be debugged and fixed.

if isinstance(past_kv, tuple) and len(past_kv) > 0 and len(past_kv[0]) > 0: kv_len = past_kv[0][0].size(-2) if kv_len != past_len: past_len = kv_len

gemini-code-assist · 2025-08-27T13:21:39Z

espnet2/text/qwen2audio_tokenizer.py

+        self._build_processor()
+
+        def create_tempfile(audio_input: Tuple[List[np.ndarray], int]) -> List[str]:
+            n_audios, sr = len(audio_input[0]), audio_input[1]
+            tempfiles = [
+                tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
+                for _ in range(n_audios)
+            ]
+            for temp, wav in zip(tempfiles, audio_input[0]):
+                sf.write(temp.name, wav, sr)
+            return [tempfile.name for tempfile in tempfiles]
+
+        def delete_tempfile(p_files: List[str]):
+            for p_file in p_files:
+                os.unlink(p_file)
+
+        # Handle audio input if provided
+        if audio_input is not None:
+            temp_files = create_tempfile(audio_input)
+
+            try:
+                # Prepare multimodal query
+                wavs_query = [
+                    {"type": "audio", "audio_url": file}
+                    for file in temp_files
+                    if os.path.exists(file)
+                ]
+
+                text_query = [{"type": "text", "text": text_input}]
+
+                query = [
+                    {"role": "system", "content": "You are a helpful assistant."},
+                    {"role": "user", "content": wavs_query + text_query},
+                ]
+
+                # Apply chat template
+                text = self.processor.apply_chat_template(
+                    query, add_generation_prompt=True, tokenize=False
+                )
+
+                # Load audio files
+                audios = [
+                    librosa.load(
+                        file, sr=self.processor.feature_extractor.sampling_rate
+                    )[0]
+                    for file in temp_files
+                    if os.path.exists(file)
+                ]
+
+                # Process inputs with both text and audio
+                inputs = self.processor(
+                    text=text,
+                    audios=audios,
+                    sampling_rate=self.processor.feature_extractor.sampling_rate,
+                    return_tensors="np",
+                    padding=True,
+                )
+                delete_tempfile(temp_files)
+            except Exception as e:
+                # Make sure to delete temp files if error occurs.
+                print(f"Error processing audio input: {e}")
+                delete_tempfile(temp_files)
+                raise e
+        else:
+            # Text-only processing
+            inputs = self.processor(text=text_input, return_tensors="np", padding=True)
+
+        return inputs


The current implementation of create_multimodal_query uses temporary files to process audio data. This is inefficient due to disk I/O and can be a point of failure (e.g., disk full, permissions). It's also unnecessarily complex, as the transformers processor can handle in-memory numpy arrays directly. This can be simplified to pass the audio arrays directly to the processor, which improves performance and robustness.

self._build_processor() # Handle audio input if provided if audio_input is not None: audios, sr = audio_input # The audio URLs are just placeholders for the chat template. # The actual audio data is passed in the `audios` argument below. wavs_query = [ {"type": "audio", "audio_url": f"placeholder_{i}.wav"} for i in range(len(audios)) ] text_query = [{"type": "text", "text": text_input}] query = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": wavs_query + text_query}, ] # Apply chat template text = self.processor.apply_chat_template( query, add_generation_prompt=True, tokenize=False ) # Process inputs with both text and audio inputs = self.processor( text=text, audios=audios, sampling_rate=sr, return_tensors="np", padding=True, ) else: # Text-only processing inputs = self.processor(text=text_input, return_tensors="np", padding=True) return inputs

mergify · 2025-08-27T15:46:43Z

This pull request is now in conflict :(

for more information, see https://pre-commit.ci

sw005320 · 2025-08-28T15:00:49Z

@cyhuang-tw, I think your errors seem to be reproducible.
The CI cannot tell anything (probably due to some errors).
Can you test it locally and debug it?

sw005320 · 2025-08-29T11:46:14Z

@cyhuang-tw, again, this PR seems to have an issue in test python, which shows different behaviours than the other PRs. Please run it locally and find some issues.

cyhuang-tw · 2025-08-29T11:47:26Z

Yes, I am running test locally.
(I've identified some issues on my local machine, and I am trying to fix them.)

for more information, see https://pre-commit.ci

sw005320 · 2025-09-10T11:16:19Z

Thanks a lot!
Finally!

cyhuang-tw and others added 18 commits June 16, 2025 23:50

Add Dynamic-SUPERB inference with Qwen2-Audio.

071eb95

Merge branch 'espnet:master' into dynamic-superb

1f1c364

Complete the evaluation pipeline for Qwen2-Audio on Dynamic-SUPERB AS…

0833dd5

…R tasks.

Add text normalization for ASR evaluation.

0bcf2ca

Add decoding config.

b0f2960

Adjust decoding hyperparameters.

9b2f7ec

Fix argument typo.

11dac51

Fix argument passing error.

5409dd6

Fix indent error.

866bcca

Fix attribute error.

9543e81

Add minlenratio.

f21d51c

Import Hugging Face's Qwen2Audio code.

fdee0d9

Try to replace attention module.

311d588

Fix attn forward arguments.

e92553d

Fix attention mask shape.

4abb2a3

Try to remove more transformers dependency.

229c001

Comment out PretrainedConfig.

e8b8ecb

Use ESPnet beam search.

4e63430

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jul 28, 2025

mergify bot added the ESPnet2 label Jul 28, 2025

dosubot bot added ASR Automatic speech recogntion Recipe labels Jul 28, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

ffe56a5

for more information, see https://pre-commit.ci

sw005320 added this to the v.202506 milestone Jul 28, 2025

sw005320 requested a review from Copilot July 28, 2025 07:35

This comment was marked as outdated.

Sign in to view

cyhuang-tw and others added 4 commits July 29, 2025 00:43

Fix CI errors.

b49507d

[pre-commit.ci] auto fixes from pre-commit.com hooks

5391830

for more information, see https://pre-commit.ci

Fix CI errors (line length, import error).

ff0c29e

[pre-commit.ci] auto fixes from pre-commit.com hooks

b3f67b7

for more information, see https://pre-commit.ci

Fixed docstring format.

b9cfd6c

gemini-code-assist bot reviewed Aug 27, 2025

View reviewed changes

Add test

66abc87

mergify bot added the conflicts label Aug 27, 2025

cyhuang-tw and others added 3 commits August 28, 2025 02:17

Revise based on Gemini's suggestions.

29fbb4b

Merge branch 'master' into dynamic-superb

4d84d5b

[pre-commit.ci] auto fixes from pre-commit.com hooks

d25a726

for more information, see https://pre-commit.ci

mergify bot removed the conflicts label Aug 27, 2025

cyhuang-tw and others added 5 commits August 28, 2025 03:13

Fix errors from unit tests.

b75b991

Remove device_map to avoid dependency on accelerate.

7e4adc9

Add tests for tokenizer and preprocessor.

cb02e77

[pre-commit.ci] auto fixes from pre-commit.com hooks

915219d

for more information, see https://pre-commit.ci

Merge branch 'master' into dynamic-superb

6abce7e

Fix CI errors.

97fae4d

mergify bot added the Installation label Aug 31, 2025

pre-commit-ci bot and others added 2 commits August 31, 2025 01:42

[pre-commit.ci] auto fixes from pre-commit.com hooks

db573df

for more information, see https://pre-commit.ci

Use a dummy model for Qwen2Audio tests.

98755b6

cyhuang-tw force-pushed the dynamic-superb branch from 8733f77 to 98755b6 Compare August 31, 2025 05:20

pre-commit-ci bot and others added 2 commits August 31, 2025 05:22

[pre-commit.ci] auto fixes from pre-commit.com hooks

79bc96f

for more information, see https://pre-commit.ci

Use a smaller model to avoid timeout during test.

a9eb97e

sw005320 merged commit e8c13a6 into espnet:master Sep 10, 2025
31 of 32 checks passed

sw005320 had a problem deploying to docker September 10, 2025 11:16 — with GitHub Actions Failure

Fhrozen mentioned this pull request Sep 11, 2025

Version Release #6236

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add recipe for Qwen2-Audio-7B-Chat on Dynamic-SUPERB ASR task#6194

Add recipe for Qwen2-Audio-7B-Chat on Dynamic-SUPERB ASR task#6194
sw005320 merged 51 commits intoespnet:masterfrom
cyhuang-tw:dynamic-superb

cyhuang-tw commented Jul 28, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

cyhuang-tw commented Aug 16, 2025

Uh oh!

Fhrozen commented Aug 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 27, 2025

Uh oh!

gemini-code-assist bot Aug 27, 2025

Uh oh!

mergify bot commented Aug 27, 2025

Uh oh!

sw005320 commented Aug 28, 2025

Uh oh!

sw005320 commented Aug 29, 2025

Uh oh!

cyhuang-tw commented Aug 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

sw005320 commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cyhuang-tw commented Jul 28, 2025

What did you change?

Why did you make this change?

Is your PR small enough?

Additional Context

Uh oh!

This comment was marked as outdated.

Uh oh!

cyhuang-tw commented Aug 16, 2025

Uh oh!

Fhrozen commented Aug 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Aug 27, 2025

Uh oh!

sw005320 commented Aug 28, 2025

Uh oh!

sw005320 commented Aug 29, 2025

Uh oh!

cyhuang-tw commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sw005320 commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cyhuang-tw commented Aug 29, 2025 •

edited

Loading