Support SOT training on LibriMix data.#4861
Conversation
simpleoier
left a comment
There was a problem hiding this comment.
Thanks @pengchengguo !
| @@ -0,0 +1,90 @@ | |||
| # network architecture | |||
There was a problem hiding this comment.
PIT model may use different data preparation with SOT. It may be a bit surprising to put this config, especially this recipe folder is under sot_asr1. I don't know. Maybe we can explicitly make some notes about this config, e.g. what data preparation, what purpose, what differences, and how to use are about it.
There was a problem hiding this comment.
The motivation for including PIT is to compare PIT and SOT on the same dataset (overlapped speech with a certain time delay). It is ok for me to remove it and only keep the SOT file here~
| frontend_conf: | ||
| frontend_conf: | ||
| upstream: wavlm_local | ||
| path_or_url: "/home/work_nfs6/pcguo/asr/librimix/hub/wavlm_large.pt" |
There was a problem hiding this comment.
This path is for your local experiments.
| frontend_conf: | ||
| frontend_conf: | ||
| upstream: wavlm_local | ||
| path_or_url: "/home/work_nfs6/pcguo/asr/librimix/hub/wavlm_large.pt" |
There was a problem hiding this comment.
The path of the config is for your own.
|
This pull request is now in conflict :( |
Codecov Report
@@ Coverage Diff @@
## master #4861 +/- ##
=======================================
Coverage 75.86% 75.86%
=======================================
Files 615 615
Lines 54684 54689 +5
=======================================
+ Hits 41487 41491 +4
- Misses 13197 13198 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
sw005320
left a comment
There was a problem hiding this comment.
- does it have to be
sot_asr1?
since it just changes the config, I think we can move this toasr1and changelocal/data.shtolocal/data_sot.sh, and it is called from the originallocal/data.sh
If this makes conflicts with the existing data directory, your current idea is better. - it is better to add some descriptions to README.md, for example, why we need CSV files, what is the SOT text format, and how
<sc>symbols are treated in the tokenizer through the--add_nonsplit_symboloption
egs2/TEMPLATE/asr1/asr.sh
Outdated
|
|
||
| _opts="--non_linguistic_symbols ${nlsyms_txt}" | ||
|
|
||
| if ${sot_asr} && [ "${token_type}" = char ]; then |
There was a problem hiding this comment.
It is better to briefly explain the SOT-based text format with an example here
There was a problem hiding this comment.
only char case?
Then, again, it's better to explain it.
There was a problem hiding this comment.
- The main difference between
sot_asr1andasr1is the data simulation process and the training data. If merging these two directories, we may need to introduce additional files (eg.data/{train_sot,dev_sot,test_sot},run_sot.sh). I think it is ok to add some files but the results inREADME.mdcan not be compared with each other (due to the different training data, full-overlapped or partial-overlapped). - I will add more descriptions about SOT
- "only char case?", in our experiments, SOT only converges well on char units, thus we only considered this case, I will check the compatibility with other units like BPE.
There was a problem hiding this comment.
only char case?
When we prepare the text, <sc> is inserted between speakers, with space on the left and right.
- For the word case,
<sc>is automatically regarded as a single word unit. So no special process for it. - For the BPE case, we use
_opts_spm+=" --user_defined_symbols=<sc>"to take care of it. It will not be mixed in other BPE subwords. - For the char case, we need to add this special arguments to let the text processing know this is a single unit. It wouldn't be split into
<,s,c,>.
| @@ -0,0 +1,154 @@ | |||
| 842M Libri2Mix/wav16k/max/dev/s2 | |||
There was a problem hiding this comment.
I will remove it.
| @@ -0,0 +1,3001 @@ | |||
| mixture_ID,source_1_path,source_1_gain,source_2_path,source_2_gain,noise_path,noise_gain,offset | |||
There was a problem hiding this comment.
can you explain why this is needed?
to avoid the full overlap in the original librimix?
There was a problem hiding this comment.
Yes, the updated csv file includes an additional column of "offset", which is used as the time delay when simulating overlapped data. Here we randomly generate offsets from 1s-1.5s.
The motivation for fixing the offset is to allow people to do a fair comparison on the same simulated data among different models.
|
This pull request is now in conflict :( |
|
@pengchengguo, can we try to finish this PR? |
|
Sorry for the delay, will finish it this weekend. |
|
There is a conflict due to the recent refactoring of |
Sure, I am checking it now. |
for more information, see https://pre-commit.ci
|
LGTM, thanks a lot! |
|
Sorry, it seems the change (remove useless |
|
OK! |
Xuankai and I would like to include SOT training (https://arxiv.org/pdf/2003.12687.pdf) in ESPnet.
We modify the CSV files of the original LibriMix repo by adding random overlap offsets from 1-1.5s. This PR provides a unified simulation metric and a comparable benchmark for non-full-overlapped multi-speaker ASR.
Note that it is not ready for merging yet. We will update some results and pre-trained models later.
TODO: