Performance mismatch between the train and inference phases (Downstream task: Speech Enhancement)

Hello! Thank you for your excellent research.

I designed the structure in such a way that the output of the WavLM-Large model is combined with the encoder output of the Speech enhancement (SE) model and passed through the decoder of the SE model. When training, I used randomly cropped 2-second speeches, and it showed great performance. But when I used actual speech of various lengths during inference, the performance was extraordinary. It worked properly when I checked by cutting it into 2-second intervals. (It showed good performance)

I greatly appreciate your time and expertise in helping me understand this issue.

* The SE model uses a 512 window size, 256 hop size, and 512 fft size. When the same speech is input, the output time frame of WavLM is short. In this case, the time frame of WavLM is linearly interpolated to make it similar to the time frame of SE.

* code
```
from s3prl.nn import S3PRLUpstream, Featurizer  
ssl_model = S3PRLUpstream("wavlm_large")
featurizer = Featurizer(ssl_model)
featurizer = featurizer

~~~
...
~~~

with torch.no_grad():
    all_hs, all_hs_len = ssl_model(inputs, torch.LongTensor([opt.chunk_size] * opt.batch_size))

hs, _ = featurizer(all_hs, all_hs_len)

hs= hs.permute(0, 2, 1)
hs= functional.interpolate(hs, out.size(-1), mode='linear')

out = torch.cat([hs, out], dim=1)

~~~
...
~~~
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance mismatch between the train and inference phases (Downstream task: Speech Enhancement) #552

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance mismatch between the train and inference phases (Downstream task: Speech Enhancement) #552

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions