-
Notifications
You must be signed in to change notification settings - Fork 515
Description
Hello! Thank you for your excellent research.
I designed the structure in such a way that the output of the WavLM-Large model is combined with the encoder output of the Speech enhancement (SE) model and passed through the decoder of the SE model. When training, I used randomly cropped 2-second speeches, and it showed great performance. But when I used actual speech of various lengths during inference, the performance was extraordinary. It worked properly when I checked by cutting it into 2-second intervals. (It showed good performance)
I greatly appreciate your time and expertise in helping me understand this issue.
-
The SE model uses a 512 window size, 256 hop size, and 512 fft size. When the same speech is input, the output time frame of WavLM is short. In this case, the time frame of WavLM is linearly interpolated to make it similar to the time frame of SE.
-
code
from s3prl.nn import S3PRLUpstream, Featurizer
ssl_model = S3PRLUpstream("wavlm_large")
featurizer = Featurizer(ssl_model)
featurizer = featurizer
~~~
...
~~~
with torch.no_grad():
all_hs, all_hs_len = ssl_model(inputs, torch.LongTensor([opt.chunk_size] * opt.batch_size))
hs, _ = featurizer(all_hs, all_hs_len)
hs= hs.permute(0, 2, 1)
hs= functional.interpolate(hs, out.size(-1), mode='linear')
out = torch.cat([hs, out], dim=1)
~~~
...
~~~