Adding general data augmentation methods for speech preprocessing#5370
Adding general data augmentation methods for speech preprocessing#5370sw005320 merged 25 commits intoespnet:masterfrom
Conversation
|
This pull request is now in conflict :( |
Codecov Report
@@ Coverage Diff @@
## master #5370 +/- ##
==========================================
+ Coverage 77.13% 77.19% +0.05%
==========================================
Files 678 679 +1
Lines 61537 61703 +166
==========================================
+ Hits 47465 47630 +165
- Misses 14072 14073 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
@Jungjee, can you review this PR? |
for more information, see https://pre-commit.ci
espnet2/layers/augmentation.py
Outdated
| -4 for shifting pitch down by 4/`bins_per_octave` octaves | ||
| 4 for shifting pitch up by 4/`bins_per_octave` octaves | ||
| bins_per_octave (int): number of steps per octave | ||
| n_fft (int): length of FFT (in second) |
There was a problem hiding this comment.
Thanks for noticing that!
| source_sample_rate = source_sample_rate // gcd | ||
| target_sample_rate = target_sample_rate // gcd | ||
|
|
||
| ret = torchaudio.functional.resample( |
There was a problem hiding this comment.
Just questions.
Did you consider applying one without pitch shift? Would it cause severe more computation?
Also, how's the training training speed with this augment (bottlenecks in loading)?
Would time stretch equal to speed_perturb with factor > 1, except for the pitch ?
There was a problem hiding this comment.
speed_perturb and time_stretch are two different time scaling method. The former changes the pitch while the latter does not. I think it is dependent on the use case. So I just provide both for the user to choose.
I haven't strictly tested the speed difference yet. Will do some test later.
| waveform, n_fft, hop_length, win_length, window=window, return_complex=True | ||
| ) | ||
| freq = spec.size(-2) | ||
| phase_advance = torch.linspace(0, math.pi * hop_length, freq)[..., None] |
There was a problem hiding this comment.
is [..., None] equivalent to .unsqueeze(-1) here?
There was a problem hiding this comment.
Yes. These are just the same operations.
| Returns: | ||
| ret (torch.Tensor): compressed signal (..., time) | ||
| """ | ||
| ret = torchaudio.functional.apply_codec( |
There was a problem hiding this comment.
how about adding some warning or exception to not be called for unwanted torch version? (or in a different place because if you put that here, it can be called too often)
There was a problem hiding this comment.
For now, I think I can just raise NotImplementedError for this function.
| if rir_path is not None: | ||
| rir, _ = soundfile.read(rir_path, dtype=np.float64, always_2d=True) | ||
| rir, fs = soundfile.read(rir_path, dtype=np.float64, always_2d=True) | ||
| if tgt_fs and fs != tgt_fs: |
There was a problem hiding this comment.
maybe better to warn of raise something because sample rate mismatch may not be intended
Jungjee
left a comment
There was a problem hiding this comment.
LGTM, mostly added suggestions/questions, not mandatory.
Thanks a lot! |
|
Strangely, I can locally pass the test in |
|
This pull request is now in conflict :( |
|
Thanks, @Emrys365! |
What?
This PR adds a series of data augmentation techniques for preprocessing speech data in various tasks:
The supported data augmentation techniques include:
The data augmentation methods can be easily configured via the yaml file:
Why?
Current preprocessors are not flexible to support applying various data augmentation methods at the same time.