doc: update OWSM data preparation instructions#6026
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6026 +/- ##
==========================================
- Coverage 20.64% 12.70% -7.95%
==========================================
Files 93 858 +765
Lines 10195 80496 +70301
==========================================
+ Hits 2105 10226 +8121
- Misses 8090 70270 +62180
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
|
@pyf98, is this PR OK to merge? |
egs2/TEMPLATE/s2t1/README.md
Outdated
| - This should not contain any special tokens except for `<na>`. In the example above, take the text between `<sop>` and `<sos>` and put it here. | ||
| - `text.ctc` contains the ASR transcript without any special token, which is used for the CTC loss. For ASR utterances, this can be derived from `text`, but for ST utterances, this is in a different language. If the ASR transcription is not available, `<na>` will be used. | ||
|
|
||
| - This should not contain any special tokens. Just take the text between `<task>` and `<eos>` and put it here (no timestamps). |
There was a problem hiding this comment.
Just take the text between
<task>and<eos>and put it here (no timestamps).
For ASR, yes. But for ST, text.ctc is different text. text.ctc is the ASR transcript.
There was a problem hiding this comment.
good catch! just fixed
|
Is it OK to merge this PR? |
|
LGTM! |
doc: update OWSM data preparation instructions
What?
Why?