Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add SWBD text processing fix#5941

Merged
sw005320 merged 5 commits intoespnet:masterfrom
siddhu001:Fix_SWBD_bug
Nov 12, 2024
Merged

Add SWBD text processing fix#5941
sw005320 merged 5 commits intoespnet:masterfrom
siddhu001:Fix_SWBD_bug

Conversation

@siddhu001
Copy link
Collaborator

What?

This is in reference to issue #726 and #5936. Based on these discussions, I have made the necessary adjustment to the Perl command used in the SWBD text processing.

Why?

The current Perl command executed on transcript files:

perl -pe 's| \(\%.*\)||g'

had a problem where, if a transcription line contained two groups of parentheses with text inside, the command treated the two groups as a single block, removing more content than intended.

For example, when running the script, this line:

"This is huh (%HESITATION) a not so (%HESITATION) inspired example"
was processed into:

"This is huh inspired example"
To address this, I updated the command as suggested, adding a question mark ? after the asterisk * in the regular expression:

perl -pe 's| \(\%.*?\)||g'

Impact of the Change

After implementing this fix, I ran the data processing and observed two types of changes:

Correction of Errors:

Original Text: BUT (%HESITATION) USED TO GO OUT YOU KNOW (WITH) SEVEN EIGHT TEN THAT RANGE
Before the Fix: "but seven eight ten that range"
After the Fix: "but used to go out you know with seven eight ten that range"

This demonstrates that there was a bug in the original SWBD data processing, and the fix effectively resolves this issue by preserving the intended content.

Handling Disfluencies:

Original Text: (%HESITATION) I (JU-) I JUST SURVIVED LOS ANGELES BUT I WAS JUST DOWN THERE LAST WEEK SO
Before the Fix: "i just survived los angeles but i was just down there last week so"
After the Fix: "i ju- i just survived los angeles but i was just down there last week so"

The previous version of the SWBD data processing tended to remove repetitions and disfluencies, which could be useful in some cases. However, with the current fix, we retain the disfluencies as they appear in the original audio, which I believe is preferable since it aligns more closely with the original transcripts.

@siddhu001
Copy link
Collaborator Author

@sw005320 can I merge this PR?

@sw005320
Copy link
Contributor

Would it change the results?
If so, I think it's better to include them.
Can you do that?

@siddhu001
Copy link
Collaborator Author

Sure thing! I will check if this leads to change in results!

@mergify mergify bot added the README label Nov 11, 2024
@siddhu001
Copy link
Collaborator Author

siddhu001 commented Nov 11, 2024

Hi @sw005320,

I ran the previously trained model (https://huggingface.co/pyf98/swbd_e_branchformer) with the updated evaluation after the data preparation fix and observed improvements across all test sets and metrics (as mentioned in the README https://github.com/espnet/espnet/pull/5941/files#diff-4bfb1426c3d1b57fa5c48dc2f0a0112aaef20b084b99cd61acde09cd343d9094). I also addressed a minor bug in local/score.sh related to incorrect paths. Given these results, I believe it would be beneficial to proceed with merging this PR!
Thanks!

@sw005320 sw005320 added this to the v.202412 milestone Nov 11, 2024
@sw005320
Copy link
Contributor

Cool!
Maybe, we reached the SOTA number.

BTW, this will affect training.
So, can you also run training (with @pyf98?)?
We might get further improvements.

@sw005320
Copy link
Contributor

I'll merge this PR after CI, but if you have the bandwidth, please also work on training.

@siddhu001
Copy link
Collaborator Author

Thanks a lot Shinji!

Currently, in local/data.sh, the Perl command is applied only to the eval2000 transcription, so I initially thought this fix wouldn’t impact model training. I’ll take a closer look and discuss with @pyf98 to check if there’s a similar issue with generating the training and validation transcripts. If I find any related issues, I’ll open another PR to update the training accordingly.

Thanks again for your feedback!

@sw005320
Copy link
Contributor

Currently, in local/data.sh, the Perl command is applied only to the eval2000 transcription, so I initially thought this fix wouldn’t impact model training.

Oh, I see.
Then, you may not have to do it.

@siddhu001
Copy link
Collaborator Author

@sw005320 , all the CI have passed and I will merge this PR soon!

@sw005320 sw005320 merged commit 19bd8f1 into espnet:master Nov 12, 2024
@sw005320
Copy link
Contributor

Thanks!

Shikhar-S pushed a commit to Shikhar-S/espnet that referenced this pull request Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants