Conversation
|
Can you add a result and model link to README.md? |
|
OK. After I finish the training, I will do that. |
simpleoier
left a comment
There was a problem hiding this comment.
Thanks! I left some comments.
| batch_size: 16 | ||
| iterator_type: chunk | ||
| chunk_length: 24000 | ||
| chunk_length: 48000 |
There was a problem hiding this comment.
Do you think it may be useful to mention the sampling rate used for this parameter?
There was a problem hiding this comment.
Yes, this is true. I think I'd better mention the sample rate in the file name.
|
|
||
| train_spk2enroll: data/train-100/spk2enroll.json | ||
| enroll_segment: 24000 | ||
| enroll_segment: 48000 |
| channel: 256 | ||
| kernel_size: 16 | ||
| stride: 8 | ||
| kernel_size: 32 |
| elif input_aux.size(-2) == 1: | ||
| aux_feature = input_aux.moveaxis(-2, -1) | ||
| else: | ||
| aux_feature = aux_feature.transpose(1, 2) # B, N, L' |
There was a problem hiding this comment.
| aux_feature = aux_feature.transpose(1, 2) # B, N, L' | |
| assert aux_feature.dim() == 3 | |
| aux_feature = aux_feature.transpose(1, 2) # B, N, L' |
| aux_feature = aux_feature.transpose(1, 2) # B, N, L' | ||
| if self.use_spk_emb: | ||
| # B, N, L'=1 | ||
| if input_aux.dim() == 2: |
There was a problem hiding this comment.
Is this expected to use input_aux instead of aux_feature here and after?
There was a problem hiding this comment.
Oh I think it should be aux_feature, although usually input_aux is equivalent here.
|
|
||
| feature = feature.transpose(1, 2) # B, N, L | ||
| aux_feature = aux_feature.transpose(1, 2) # B, N, L' | ||
| if self.use_spk_emb: |
There was a problem hiding this comment.
It's a bit difficult to follow/understand in what cases use_spk_emb=True if I didn't know speakerbeam. Can you add some introduction comment here or above ?
There was a problem hiding this comment.
I will add a comment above:
# NOTE(wangyou): When `self.use_spk_emb` is True, `aux_feature` is assumed to be
# a speaker embedding; otherwise, it is assumed to be an enrollment audio.
if self.use_spk_emb:
...| layer_norm, bottleneck_conv1x1, temporal_conv_net, mask_conv1x1 | ||
| ) | ||
| if pre_mask_nonlinear == "linear": | ||
| self.network = nn.Sequential( |
There was a problem hiding this comment.
A minor question, if the self.network is used separately in forward (e.g. bottleneck, tcn, masknet called individually), what's the benefit of defining them in a Sequential(). I found it is a bit ambiguous in naming the subnet in forward().
There was a problem hiding this comment.
This is just for back-compatibility with Conv-TasNet. Because previously the TCN implementation also used Sequential for speech separation, which can be used as a whole. But for TD-SpeakerBeam, we cannot use the Sequential module directly because of the input/output mismatch between sub-modules.
|
Hmm |
It seems this issue has been reported since Jan 17, but they have not fixed it yet. Probably we should use |
Codecov Report
@@ Coverage Diff @@
## master #5155 +/- ##
==========================================
- Coverage 74.99% 74.99% -0.01%
==========================================
Files 618 618
Lines 55588 55603 +15
==========================================
+ Hits 41689 41700 +11
- Misses 13899 13903 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
The model has also been uploaded to HuggingFace: https://huggingface.co/espnet/Wangyou_Zhang_librimix_train_enh_tse_td_speakerbeam_raw |
This PR mainly update the implementation of TD-SpeakerBeam for target speaker extraction: