Add recipe for Qwen2-Audio-7B-Chat on Dynamic-SUPERB ASR task#6194
Add recipe for Qwen2-Audio-7B-Chat on Dynamic-SUPERB ASR task#6194sw005320 merged 51 commits intoespnet:masterfrom
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
Thanks for the suggestion! I've completed most of it, now working on the test part. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds a new recipe for Qwen2-Audio-7B-Chat on the Dynamic-SUPERB ASR task. This is a significant contribution, including a new PS2ST (prompt-based speech-to-text) task, a model wrapper for Qwen2-Audio, a custom scorer, a tokenizer, and a preprocessor, along with the necessary recipe scripts. The overall structure is sound and follows ESPnet conventions. I've identified two critical issues that should be addressed. One is related to inefficient and unsafe use of temporary files for audio processing, which can be replaced by direct in-memory processing. The other is a dangerous use of a broad except Exception: pass block that can hide critical errors during decoding. Fixing these will improve the robustness and performance of this new integration.
espnet2/ps2st/qwen2_scorer.py
Outdated
| # Fallback to read from KV if step got desynced | ||
| if isinstance(past_kv, tuple) and len(past_kv) > 0 and len(past_kv[0]) > 0: | ||
| try: | ||
| kv_len = past_kv[0][0].size(-2) | ||
| if kv_len != past_len: | ||
| past_len = kv_len | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
The except Exception: pass block is dangerous as it will silently ignore any and all errors that occur within the try block. This can lead to incorrect decoding results or hide serious bugs in the key-value cache handling. The preceding if condition should be sufficient to guard against invalid past_kv structures. If an error still occurs, it should be allowed to fail loudly so it can be debugged and fixed.
if isinstance(past_kv, tuple) and len(past_kv) > 0 and len(past_kv[0]) > 0:
kv_len = past_kv[0][0].size(-2)
if kv_len != past_len:
past_len = kv_len| self._build_processor() | ||
|
|
||
| def create_tempfile(audio_input: Tuple[List[np.ndarray], int]) -> List[str]: | ||
| n_audios, sr = len(audio_input[0]), audio_input[1] | ||
| tempfiles = [ | ||
| tempfile.NamedTemporaryFile(suffix=".wav", delete=False) | ||
| for _ in range(n_audios) | ||
| ] | ||
| for temp, wav in zip(tempfiles, audio_input[0]): | ||
| sf.write(temp.name, wav, sr) | ||
| return [tempfile.name for tempfile in tempfiles] | ||
|
|
||
| def delete_tempfile(p_files: List[str]): | ||
| for p_file in p_files: | ||
| os.unlink(p_file) | ||
|
|
||
| # Handle audio input if provided | ||
| if audio_input is not None: | ||
| temp_files = create_tempfile(audio_input) | ||
|
|
||
| try: | ||
| # Prepare multimodal query | ||
| wavs_query = [ | ||
| {"type": "audio", "audio_url": file} | ||
| for file in temp_files | ||
| if os.path.exists(file) | ||
| ] | ||
|
|
||
| text_query = [{"type": "text", "text": text_input}] | ||
|
|
||
| query = [ | ||
| {"role": "system", "content": "You are a helpful assistant."}, | ||
| {"role": "user", "content": wavs_query + text_query}, | ||
| ] | ||
|
|
||
| # Apply chat template | ||
| text = self.processor.apply_chat_template( | ||
| query, add_generation_prompt=True, tokenize=False | ||
| ) | ||
|
|
||
| # Load audio files | ||
| audios = [ | ||
| librosa.load( | ||
| file, sr=self.processor.feature_extractor.sampling_rate | ||
| )[0] | ||
| for file in temp_files | ||
| if os.path.exists(file) | ||
| ] | ||
|
|
||
| # Process inputs with both text and audio | ||
| inputs = self.processor( | ||
| text=text, | ||
| audios=audios, | ||
| sampling_rate=self.processor.feature_extractor.sampling_rate, | ||
| return_tensors="np", | ||
| padding=True, | ||
| ) | ||
| delete_tempfile(temp_files) | ||
| except Exception as e: | ||
| # Make sure to delete temp files if error occurs. | ||
| print(f"Error processing audio input: {e}") | ||
| delete_tempfile(temp_files) | ||
| raise e | ||
| else: | ||
| # Text-only processing | ||
| inputs = self.processor(text=text_input, return_tensors="np", padding=True) | ||
|
|
||
| return inputs |
There was a problem hiding this comment.
The current implementation of create_multimodal_query uses temporary files to process audio data. This is inefficient due to disk I/O and can be a point of failure (e.g., disk full, permissions). It's also unnecessarily complex, as the transformers processor can handle in-memory numpy arrays directly. This can be simplified to pass the audio arrays directly to the processor, which improves performance and robustness.
self._build_processor()
# Handle audio input if provided
if audio_input is not None:
audios, sr = audio_input
# The audio URLs are just placeholders for the chat template.
# The actual audio data is passed in the `audios` argument below.
wavs_query = [
{"type": "audio", "audio_url": f"placeholder_{i}.wav"}
for i in range(len(audios))
]
text_query = [{"type": "text", "text": text_input}]
query = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": wavs_query + text_query},
]
# Apply chat template
text = self.processor.apply_chat_template(
query, add_generation_prompt=True, tokenize=False
)
# Process inputs with both text and audio
inputs = self.processor(
text=text,
audios=audios,
sampling_rate=sr,
return_tensors="np",
padding=True,
)
else:
# Text-only processing
inputs = self.processor(text=text_input, return_tensors="np", padding=True)
return inputs|
This pull request is now in conflict :( |
for more information, see https://pre-commit.ci
|
@cyhuang-tw, I think your errors seem to be reproducible. |
|
@cyhuang-tw, again, this PR seems to have an issue in |
|
Yes, I am running test locally. |
8733f77 to
98755b6
Compare
|
Thanks a lot! |
What did you change?
Added a recipe for running Qwen2-Audio-7B-Chat on the ASR task in Dynamic-SUPERB Phase-2.
Why did you make this change?
To provide a ready-to-use baseline for evaluating spoken language models on Dynamic-SUPERB.
Is your PR small enough?
No. This PR modifies 30 files with around 1,360 lines added, primarily within the new recipe directory.
Additional Context
Qwen2-Audio-7B-Chat: [paper], [GitHub]
Dynamic-SUPERB Phase-2: [paper], [GitHub]