[ESPnet-3] Merge master into espnet3 branch#6263
[ESPnet-3] Merge master into espnet3 branch#6263Masao-Someki merged 138 commits intoespnet:espnet3from
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Get forced alignments from CTC model
SpeechLM Data Infra: dataset management
for more information, see https://pre-commit.ci
Fix HF tests by switching them to upstream testing models
|
@sw005320 |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request merges a significant number of changes from the master branch, introducing new features like Language Identification (LID) and SpeechLM tasks, along with forced alignment capabilities and new data samplers. The updates also include robustness improvements, such as better dependency handling and support for Apple's MPS devices. While the majority of the changes appear solid and well-integrated with new tests, I've identified a critical issue in the new force_align.py script that could lead to incorrect results by silently truncating input audio. My review focuses on this critical point to ensure the script's correctness.
| def prepare_speech(speech, model, device): | ||
| """ | ||
| Prepare speech tensor for model input. | ||
|
|
||
| Args: | ||
| speech: Audio waveform (numpy array or torch tensor) | ||
| model: Speech2Text model instance | ||
| device: Device to place tensor on | ||
|
|
||
| Returns: | ||
| Tuple of (speech_tensor, speech_lengths) | ||
| """ | ||
| if isinstance(speech, np.ndarray): | ||
| speech = torch.tensor(speech) | ||
|
|
||
| if speech.dim() > 1: | ||
| assert ( | ||
| speech.dim() == 2 and speech.size(1) == 1 | ||
| ), f"Speech of size {speech.size()} is not supported" | ||
| speech = speech.squeeze(1) | ||
|
|
||
| speech_length = int( | ||
| model.preprocessor_conf["fs"] * model.preprocessor_conf["speech_length"] | ||
| ) | ||
| original_length = speech.size(-1) | ||
|
|
||
| if original_length >= speech_length: | ||
| speech = speech[:speech_length] | ||
| else: | ||
| speech = F.pad(speech, (0, speech_length - original_length)) | ||
| speech = speech.unsqueeze(0).to(getattr(torch, model.dtype)) | ||
| speech_lengths = speech.new_full([1], dtype=torch.long, fill_value=speech.shape[1]) | ||
| return speech, speech_lengths |
There was a problem hiding this comment.
The current implementation of prepare_speech truncates or pads the input audio to a fixed length derived from the model's training configuration. This will cause any audio file longer than the configured speech_length to be silently truncated, leading to incomplete and incorrect forced alignment results. An alignment utility should process the entire audio file to be useful in a general context.
The suggested change removes this fixed-length processing, ensuring that the entire audio waveform is passed to the model for alignment.
def prepare_speech(speech, model, device):
"""
Prepare speech tensor for model input.
Args:
speech: Audio waveform (numpy array or torch tensor)
model: Speech2Text model instance
device: Device to place tensor on
Returns:
Tuple of (speech_tensor, speech_lengths)
"""
if isinstance(speech, np.ndarray):
speech = torch.tensor(speech)
if speech.dim() > 1:
assert (
speech.dim() == 2 and speech.size(1) == 1
), f"Speech of size {speech.size()} is not supported"
speech = speech.squeeze(1)
speech = speech.unsqueeze(0).to(getattr(torch, model.dtype))
speech_lengths = speech.new_full([1], dtype=torch.long, fill_value=speech.shape[1])
return speech, speech_lengths- Replace espnet -> espnet2.legacy
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## espnet3 #6263 +/- ##
============================================
+ Coverage 0 70.14% +70.14%
============================================
Files 0 751 +751
Lines 0 69057 +69057
============================================
+ Hits 0 48441 +48441
- Misses 0 20616 +20616
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What did you change?
Merged master branch into espnet3 branch
Why did you make this change?
To fix the CI issue.
See #6178 for details
Is your PR small enough?
no, but this is just a merge PR
Additional Context