Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Update TD-SpeakerBeam#5155

Merged
mergify[bot] merged 5 commits intoespnet:masterfrom
Emrys365:tse
May 15, 2023
Merged

Update TD-SpeakerBeam#5155
mergify[bot] merged 5 commits intoespnet:masterfrom
Emrys365:tse

Conversation

@Emrys365
Copy link
Collaborator

@Emrys365 Emrys365 commented May 3, 2023

This PR mainly update the implementation of TD-SpeakerBeam for target speaker extraction:

  1. It now also support the speaker embedding as an auxiliary input.
  2. A pre-mask activation is added by default to make the training easier.

@sw005320
Copy link
Contributor

sw005320 commented May 3, 2023

Can you add a result and model link to README.md?

@sw005320 sw005320 requested a review from simpleoier May 3, 2023 23:51
@Emrys365
Copy link
Collaborator Author

Emrys365 commented May 4, 2023

OK. After I finish the training, I will do that.

Copy link
Collaborator

@simpleoier simpleoier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I left some comments.

batch_size: 16
iterator_type: chunk
chunk_length: 24000
chunk_length: 48000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it may be useful to mention the sampling rate used for this parameter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is true. I think I'd better mention the sample rate in the file name.


train_spk2enroll: data/train-100/spk2enroll.json
enroll_segment: 24000
enroll_segment: 48000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

channel: 256
kernel_size: 16
stride: 8
kernel_size: 32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

elif input_aux.size(-2) == 1:
aux_feature = input_aux.moveaxis(-2, -1)
else:
aux_feature = aux_feature.transpose(1, 2) # B, N, L'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
aux_feature = aux_feature.transpose(1, 2) # B, N, L'
assert aux_feature.dim() == 3
aux_feature = aux_feature.transpose(1, 2) # B, N, L'

aux_feature = aux_feature.transpose(1, 2) # B, N, L'
if self.use_spk_emb:
# B, N, L'=1
if input_aux.dim() == 2:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected to use input_aux instead of aux_feature here and after?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think it should be aux_feature, although usually input_aux is equivalent here.


feature = feature.transpose(1, 2) # B, N, L
aux_feature = aux_feature.transpose(1, 2) # B, N, L'
if self.use_spk_emb:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit difficult to follow/understand in what cases use_spk_emb=True if I didn't know speakerbeam. Can you add some introduction comment here or above ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment above:

# NOTE(wangyou): When `self.use_spk_emb` is True, `aux_feature` is assumed to be
# a speaker embedding; otherwise, it is assumed to be an enrollment audio.
if self.use_spk_emb:
...

layer_norm, bottleneck_conv1x1, temporal_conv_net, mask_conv1x1
)
if pre_mask_nonlinear == "linear":
self.network = nn.Sequential(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor question, if the self.network is used separately in forward (e.g. bottleneck, tcn, masknet called individually), what's the benefit of defining them in a Sequential(). I found it is a bit ambiguous in naming the subnet in forward().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for back-compatibility with Conv-TasNet. Because previously the TCN implementation also used Sequential for speech separation, which can be used as a whole. But for TD-SpeakerBeam, we cannot use the Sequential module directly because of the input/output mismatch between sub-modules.

Copy link
Collaborator

@simpleoier simpleoier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sw005320
Copy link
Contributor

sw005320 commented May 4, 2023

Hmm
https://github.com/espnet/espnet/actions/runs/4879274957/jobs/8705692129?pr=5155#step:8:35
Not sure. Should we wait for the fairseq to fix this?

@Emrys365
Copy link
Collaborator Author

Emrys365 commented May 4, 2023

Hmm https://github.com/espnet/espnet/actions/runs/4879274957/jobs/8705692129?pr=5155#step:8:35 Not sure. Should we wait for the fairseq to fix this?

It seems this issue has been reported since Jan 17, but they have not fixed it yet.

Probably we should use numpy<=1.23.3 before fairseq is updated.

@codecov
Copy link

codecov bot commented May 11, 2023

Codecov Report

Merging #5155 (a93775c) into master (84f3bde) will decrease coverage by 0.01%.
The diff coverage is 80.95%.

@@            Coverage Diff             @@
##           master    #5155      +/-   ##
==========================================
- Coverage   74.99%   74.99%   -0.01%     
==========================================
  Files         618      618              
  Lines       55588    55603      +15     
==========================================
+ Hits        41689    41700      +11     
- Misses      13899    13903       +4     
Flag Coverage Δ
test_integration_espnet1 66.28% <ø> (ø)
test_integration_espnet2 47.60% <61.90%> (-0.01%) ⬇️
test_python 65.45% <80.95%> (+<0.01%) ⬆️
test_utils 23.28% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
espnet2/enh/extractor/td_speakerbeam_extractor.py 90.24% <73.33%> (-9.76%) ⬇️
espnet2/enh/layers/tcn.py 95.63% <100.00%> (+0.06%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@sw005320 sw005320 added the auto-merge Enable auto-merge label May 15, 2023
@mergify mergify bot merged commit 6e35c14 into espnet:master May 15, 2023
@Emrys365
Copy link
Collaborator Author

Emrys365 commented Jun 5, 2023

The model has also been uploaded to HuggingFace: https://huggingface.co/espnet/Wangyou_Zhang_librimix_train_enh_tse_td_speakerbeam_raw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge Enable auto-merge ESPnet2 Recipe SE Speech enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants