Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Performance mismatch between the train and inference phases (Downstream task: Speech Enhancement) #552

@seorim0

Description

@seorim0

Hello! Thank you for your excellent research.

I designed the structure in such a way that the output of the WavLM-Large model is combined with the encoder output of the Speech enhancement (SE) model and passed through the decoder of the SE model. When training, I used randomly cropped 2-second speeches, and it showed great performance. But when I used actual speech of various lengths during inference, the performance was extraordinary. It worked properly when I checked by cutting it into 2-second intervals. (It showed good performance)

I greatly appreciate your time and expertise in helping me understand this issue.

  • The SE model uses a 512 window size, 256 hop size, and 512 fft size. When the same speech is input, the output time frame of WavLM is short. In this case, the time frame of WavLM is linearly interpolated to make it similar to the time frame of SE.

  • code

from s3prl.nn import S3PRLUpstream, Featurizer  
ssl_model = S3PRLUpstream("wavlm_large")
featurizer = Featurizer(ssl_model)
featurizer = featurizer

~~~
...
~~~

with torch.no_grad():
    all_hs, all_hs_len = ssl_model(inputs, torch.LongTensor([opt.chunk_size] * opt.batch_size))

hs, _ = featurizer(all_hs, all_hs_len)

hs= hs.permute(0, 2, 1)
hs= functional.interpolate(hs, out.size(-1), mode='linear')

out = torch.cat([hs, out], dim=1)

~~~
...
~~~

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions