Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Issue if train_set or valid_set are included in test sets#4944

Closed
kamo-naoyuki wants to merge 1 commit intoespnet:masterfrom
kamo-naoyuki:filtering
Closed

Issue if train_set or valid_set are included in test sets#4944
kamo-naoyuki wants to merge 1 commit intoespnet:masterfrom
kamo-naoyuki:filtering

Conversation

@kamo-naoyuki
Copy link
Collaborator

@kamo-naoyuki kamo-naoyuki commented Feb 17, 2023

Issue:

If using a test_set as the train_set or valid_set in asr.sh, the test set is modified by stage 4 Remove long/short utt

Modify:

- Current behaviour: ${data_feats}/org/${dset} at stage 3 -> ${data_feats}/${dset} at stage 4
- In this PR: ${data_feats}/${dset} at stage 3 -> ${data_feats}/${dset}_flt at stage 4

I only modified asr.sh in this PR, but all templates has same problem (due to my bad original template script...)

@sw005320

@mergify mergify bot added the ESPnet2 label Feb 17, 2023
@codecov
Copy link

codecov bot commented Feb 17, 2023

Codecov Report

Merging #4944 (1835a92) into master (9c7bde4) will increase coverage by 0.01%.
The diff coverage is 86.88%.

@@            Coverage Diff             @@
##           master    #4944      +/-   ##
==========================================
+ Coverage   76.63%   76.65%   +0.01%     
==========================================
  Files         604      604              
  Lines       53934    53992      +58     
==========================================
+ Hits        41334    41385      +51     
- Misses      12600    12607       +7     
Flag Coverage Δ
test_integration_espnet1 66.33% <ø> (ø)
test_integration_espnet2 47.42% <49.18%> (+<0.01%) ⬆️
test_python 66.57% <81.96%> (+0.01%) ⬆️
test_utils 23.35% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
espnet2/samplers/build_batch_sampler.py 92.85% <ø> (ø)
espnet2/train/iterable_dataset.py 84.67% <75.00%> (-0.80%) ⬇️
espnet2/tasks/abs_task.py 75.90% <85.71%> (+0.22%) ⬆️
espnet2/samplers/sorted_batch_sampler.py 87.50% <87.50%> (ø)
espnet2/samplers/unsorted_batch_sampler.py 83.33% <87.50%> (+0.83%) ⬆️
espnet2/samplers/folded_batch_sampler.py 85.55% <88.88%> (+0.37%) ⬆️
espnet2/samplers/length_batch_sampler.py 87.80% <88.88%> (+0.13%) ⬆️
espnet2/samplers/num_elements_batch_sampler.py 87.64% <88.88%> (+0.14%) ⬆️
espnet2/main_funcs/collect_stats.py 90.90% <100.00%> (+0.28%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mergify mergify bot added the CI Travis, Circle CI, etc label Feb 17, 2023
@kamo-naoyuki kamo-naoyuki force-pushed the filtering branch 5 times, most recently from 03fd1f1 to 6c5d459 Compare February 19, 2023 05:37
@kamo-naoyuki
Copy link
Collaborator Author

I changed my mind.

I implemented --filtered_train_key_text and --filtered_valid_key_text for espnet2/bin/*_train.py.

With giving a text file containing IDs to be filtered,
The samples specified by this option are excluded from the training.

e.g.

  • wav.scp:
IDa a.wav
IDb b.wav
IDc b.wav
  • filtered_key.txt
IDb

In this case, IDb is excluded, and IDa and IDc remain for the training.

I also changed asr.sh to create filtered_key.txt at stage 4 instead of creating a new dataset.

@kamo-naoyuki
Copy link
Collaborator Author

I changed my mind again.

Filtering short/long utterances by the option of the python tool is better way as a viewpoint for a smart recipe, but it could make some overhead for start at the startup.

Creating another dataset is a dirty way, but actually efficient for training speed.

I'll think about it.

@kamo-naoyuki kamo-naoyuki mentioned this pull request Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Travis, Circle CI, etc ESPnet2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant