Conversation
|
@sw005320 can I merge this PR? |
|
Would it change the results? |
|
Sure thing! I will check if this leads to change in results! |
|
Hi @sw005320, I ran the previously trained model (https://huggingface.co/pyf98/swbd_e_branchformer) with the updated evaluation after the data preparation fix and observed improvements across all test sets and metrics (as mentioned in the README https://github.com/espnet/espnet/pull/5941/files#diff-4bfb1426c3d1b57fa5c48dc2f0a0112aaef20b084b99cd61acde09cd343d9094). I also addressed a minor bug in local/score.sh related to incorrect paths. Given these results, I believe it would be beneficial to proceed with merging this PR! |
|
Cool! BTW, this will affect training. |
|
I'll merge this PR after CI, but if you have the bandwidth, please also work on training. |
|
Thanks a lot Shinji! Currently, in local/data.sh, the Perl command is applied only to the eval2000 transcription, so I initially thought this fix wouldn’t impact model training. I’ll take a closer look and discuss with @pyf98 to check if there’s a similar issue with generating the training and validation transcripts. If I find any related issues, I’ll open another PR to update the training accordingly. Thanks again for your feedback! |
Oh, I see. |
|
@sw005320 , all the CI have passed and I will merge this PR soon! |
|
Thanks! |
Add SWBD text processing fix
What?
This is in reference to issue #726 and #5936. Based on these discussions, I have made the necessary adjustment to the Perl command used in the SWBD text processing.
Why?
The current Perl command executed on transcript files:
perl -pe 's| \(\%.*\)||g'had a problem where, if a transcription line contained two groups of parentheses with text inside, the command treated the two groups as a single block, removing more content than intended.
For example, when running the script, this line:
"This is huh (%HESITATION) a not so (%HESITATION) inspired example"was processed into:
"This is huh inspired example"To address this, I updated the command as suggested, adding a question mark ? after the asterisk * in the regular expression:
perl -pe 's| \(\%.*?\)||g'Impact of the Change
After implementing this fix, I ran the data processing and observed two types of changes:
Correction of Errors:
This demonstrates that there was a bug in the original SWBD data processing, and the fix effectively resolves this issue by preserving the intended content.
Handling Disfluencies:
The previous version of the SWBD data processing tended to remove repetitions and disfluencies, which could be useful in some cases. However, with the current fix, we retain the disfluencies as they appear in the original audio, which I believe is preferable since it aligns more closely with the original transcripts.