Update the Enh framework to support training with variable numbers of speakers#5414
Update the Enh framework to support training with variable numbers of speakers#5414sw005320 merged 13 commits intoespnet:masterfrom
Conversation
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## master #5414 +/- ##
==========================================
- Coverage 77.17% 75.29% -1.89%
==========================================
Files 684 708 +24
Lines 62686 65085 +2399
==========================================
+ Hits 48379 49006 +627
- Misses 14307 16079 +1772
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 51 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
espnet2/train/preprocessor.py
Outdated
| ] | ||
| length = speech_refs[0].shape[0] | ||
| tgt_length = self.speech_segment | ||
| assert length >= self.speech_segment, (length, tgt_length) |
There was a problem hiding this comment.
One thing worth discussing is whether we should allow the input signal to be shorter than the specified speech_segment.
baf5d67 to
792329b
Compare
|
@simpleoier, can you review this PR? |
|
@LiChenda, can you also review this PR? |
Sure, I'll review it. |
|
|
||
| Note: | ||
| All audios in each sample must have the same sampling rates. | ||
| Audios of different lengths in each sample will be right-padded with np.nan |
There was a problem hiding this comment.
Why padded with np.nan here?
There was a problem hiding this comment.
It is just in case that the signals defined in multiple columns have different lengths. These nan values will/should be removed or ensured to be absent in the preprocessor, e.g.,
espnet/espnet2/train/preprocessor.py
Lines 1162 to 1171 in 1bba669
espnet2/tasks/enh.py
Outdated
| "--speech_segment", | ||
| type=int_or_none, | ||
| default=None, | ||
| help="Truncate the audios to the specified length if not None", |
There was a problem hiding this comment.
Truncate the audio to the specified length (If I understand correctly, the length is in sampling points?)
There was a problem hiding this comment.
Yes it's number of samples. Have updated the help-string.
LiChenda
left a comment
There was a problem hiding this comment.
I left some comments on this PR. Do you have a plan to add some related recipes which have variable number of speakers?
| help="Truncate the audios to the specified length if not None", | ||
| ) | ||
| group.add_argument( | ||
| "--avoid_allzero_segment", |
There was a problem hiding this comment.
Is it necessary to provide this option? I feel the default True is fine.
There was a problem hiding this comment.
Sometimes we may want to train the model to generate silence, and sometimes we may want to avoid the silence as the training target depending on the loss function we use. So I just make this argument to enable both possibilities.
espnet2/train/dataset.py
Outdated
| f"Sampling rates are mismatched: {self.rate} != {rate}" | ||
| ) | ||
| if not self.allow_multi_rates: | ||
| if self.rate is not None and self.rate != rate: |
There was a problem hiding this comment.
How about reducing two if into one?
if not self.allow_multi_rates and self.rate is not None and self.rate != rate:
Yes, the recipe will be added in the followup PRs soon. |
|
Can we merge this PR? |
|
Thanks a lot for your great efforts! |
What?
This PR adds support for variable numbers of speakers when training speech separation models.
Major updates are added in
--variable_num_refsis added to specify whether we are training models with variable numbers of speakers in each mixture sample.spk1.scp,dereverb1.scp, andenroll_spk1.scpis introduced. Each row may contain variable columns, where each column starting from the 2nd column represents a different reference speaker.variable_columns_soundis added to espnet2/train/dataset.py and espnet2/train/iterable_dataset.py.spk1.scpanddereverb1.scp(if provided) whenvariable_num_refsis true.enroll_spk1.scpis loaded as thetexttype, so there is no problem if the enrollment audios of the same sample have different lengths.speech_segment: Optional[int] = Noneis added toEnhPreprocessorandTSEPreprocessorto allow only loading a fixed-length segment from each audio. This is useful if we want to control the balance between data of different lengths.avoid_allzero_segment: bool = Trueis added toEnhPreprocessorandTSEPreprocessor. This argument is only useful whenspeech_segmentis specified. If this argument is True, the segmentation process will ensure that all reference audios in the selected segment are not all-zero.flexible_numspk: bool = Falseis added toEnhPreprocessorandTSEPreprocessorto handle the variable-speaker-number inputs in thespeech_ref1,dereverb_ref1, andenroll_ref1.flexible_numspkis added to handle training with variable numbers of speakers.In addition, some updates are added for future extension (no test is available now):
--allow_multi_ratesis added to allow loading audios with different sampling rates. This is useful if we want to build a universal model that can handle different sampling rates.flexible_numspkis added to handle training with variable numbers of speakers.additionalis added to theAbsExtractorforward method. This makes it consistent with theAbsSeparator, which is flexible for future extension.Why?
Speech separation with an unknown number of speakers is a popular research topic but has not been supported in ESPnet due to its variable data size.
Note
Currently the inference and evaluation stages are not updated to support variable numbers of speakers. This part needs further discussion to make a good design.