Codestin Search App

siddhu001 · 2024-10-31T02:12:12Z

What?

This is in reference to issue #726 and #5936. Based on these discussions, I have made the necessary adjustment to the Perl command used in the SWBD text processing.

Why?

The current Perl command executed on transcript files:

perl -pe 's| \(\%.*\)||g'

had a problem where, if a transcription line contained two groups of parentheses with text inside, the command treated the two groups as a single block, removing more content than intended.

For example, when running the script, this line:

"This is huh (%HESITATION) a not so (%HESITATION) inspired example"
was processed into:

"This is huh inspired example"
To address this, I updated the command as suggested, adding a question mark ? after the asterisk * in the regular expression:

perl -pe 's| \(\%.*?\)||g'

Impact of the Change

After implementing this fix, I ran the data processing and observed two types of changes:

Correction of Errors:

Original Text: BUT (%HESITATION) USED TO GO OUT YOU KNOW (WITH) SEVEN EIGHT TEN THAT RANGE
Before the Fix: "but seven eight ten that range"
After the Fix: "but used to go out you know with seven eight ten that range"

This demonstrates that there was a bug in the original SWBD data processing, and the fix effectively resolves this issue by preserving the intended content.

Handling Disfluencies:

Original Text: (%HESITATION) I (JU-) I JUST SURVIVED LOS ANGELES BUT I WAS JUST DOWN THERE LAST WEEK SO
Before the Fix: "i just survived los angeles but i was just down there last week so"
After the Fix: "i ju- i just survived los angeles but i was just down there last week so"

The previous version of the SWBD data processing tended to remove repetitions and disfluencies, which could be useful in some cases. However, with the current fix, we retain the disfluencies as they appear in the original audio, which I believe is preferable since it aligns more closely with the original transcripts.

siddhu001 · 2024-11-10T16:43:28Z

@sw005320 can I merge this PR?

sw005320 · 2024-11-11T01:37:33Z

Would it change the results?
If so, I think it's better to include them.
Can you do that?

siddhu001 · 2024-11-11T03:11:20Z

Sure thing! I will check if this leads to change in results!

siddhu001 · 2024-11-11T22:00:19Z

Hi @sw005320,

I ran the previously trained model (https://huggingface.co/pyf98/swbd_e_branchformer) with the updated evaluation after the data preparation fix and observed improvements across all test sets and metrics (as mentioned in the README https://github.com/espnet/espnet/pull/5941/files#diff-4bfb1426c3d1b57fa5c48dc2f0a0112aaef20b084b99cd61acde09cd343d9094). I also addressed a minor bug in local/score.sh related to incorrect paths. Given these results, I believe it would be beneficial to proceed with merging this PR!
Thanks!

sw005320 · 2024-11-11T22:50:30Z

Cool!
Maybe, we reached the SOTA number.

BTW, this will affect training.
So, can you also run training (with @pyf98?)?
We might get further improvements.

sw005320 · 2024-11-11T22:50:58Z

I'll merge this PR after CI, but if you have the bandwidth, please also work on training.

siddhu001 · 2024-11-11T22:58:35Z

Thanks a lot Shinji!

Currently, in local/data.sh, the Perl command is applied only to the eval2000 transcription, so I initially thought this fix wouldn’t impact model training. I’ll take a closer look and discuss with @pyf98 to check if there’s a similar issue with generating the training and validation transcripts. If I find any related issues, I’ll open another PR to update the training accordingly.

Thanks again for your feedback!

sw005320 · 2024-11-11T23:06:55Z

Currently, in local/data.sh, the Perl command is applied only to the eval2000 transcription, so I initially thought this fix wouldn’t impact model training.

Oh, I see.
Then, you may not have to do it.

siddhu001 · 2024-11-12T14:53:54Z

@sw005320 , all the CI have passed and I will merge this PR soon!

sw005320 · 2024-11-12T14:57:44Z

Thanks!

Add SWBD text processing fix

“siddhu001” added 2 commits October 30, 2024 20:38

Add SWBD text processing fix

2b013c8

Add SWBD text processing fix

065970a

mergify bot added the ESPnet2 label Oct 31, 2024

siddhu001 mentioned this pull request Oct 31, 2024

Eval2000 text preprocess bug #5936

Closed

Update results after SWBD text processing fix

78307eb

mergify bot added the README label Nov 11, 2024

“siddhu001” added 2 commits November 11, 2024 15:52

Update results after SWBD text processing fix

1e7a8fa

Update results after SWBD text processing fix

bc8a982

sw005320 added this to the v.202412 milestone Nov 11, 2024

sw005320 added the Bugfix label Nov 11, 2024

sw005320 merged commit 19bd8f1 into espnet:master Nov 12, 2024

Shikhar-S pushed a commit to Shikhar-S/espnet that referenced this pull request Mar 13, 2025

Merge pull request espnet#5941 from siddhu001/Fix_SWBD_bug

6124aa5

Add SWBD text processing fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SWBD text processing fix#5941

Add SWBD text processing fix#5941
sw005320 merged 5 commits intoespnet:masterfrom
siddhu001:Fix_SWBD_bug

siddhu001 commented Oct 31, 2024

Uh oh!

siddhu001 commented Nov 10, 2024

Uh oh!

sw005320 commented Nov 11, 2024

Uh oh!

siddhu001 commented Nov 11, 2024

Uh oh!

siddhu001 commented Nov 11, 2024 •

edited

Loading

Uh oh!

sw005320 commented Nov 11, 2024

Uh oh!

sw005320 commented Nov 11, 2024

Uh oh!

siddhu001 commented Nov 11, 2024

Uh oh!

sw005320 commented Nov 11, 2024

Uh oh!

siddhu001 commented Nov 12, 2024

Uh oh!

sw005320 commented Nov 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

siddhu001 commented Oct 31, 2024

What?

Why?

Impact of the Change

Correction of Errors:

Handling Disfluencies:

Uh oh!

siddhu001 commented Nov 10, 2024

Uh oh!

sw005320 commented Nov 11, 2024

Uh oh!

siddhu001 commented Nov 11, 2024

Uh oh!

siddhu001 commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sw005320 commented Nov 11, 2024

Uh oh!

sw005320 commented Nov 11, 2024

Uh oh!

siddhu001 commented Nov 11, 2024

Uh oh!

sw005320 commented Nov 11, 2024

Uh oh!

siddhu001 commented Nov 12, 2024

Uh oh!

sw005320 commented Nov 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

siddhu001 commented Nov 11, 2024 •

edited

Loading