Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ESPnet-codec Training and Setup#5732

Merged
ftshijt merged 42 commits intoespnet:codecfrom
ftshijt:codec
Apr 19, 2024
Merged

ESPnet-codec Training and Setup#5732
ftshijt merged 42 commits intoespnet:codecfrom
ftshijt:codec

Conversation

@ftshijt
Copy link
Collaborator

@ftshijt ftshijt commented Apr 4, 2024

What?

New ESPnet codec project

The initial PR (to the local codec branch) supports:

  • The library set up of the codec task
  • The recipe setup for mini_an4(debugging purpose and future CI support) and libriTTS recipe
  • A base training framework with soundstream model

References

https://github.com/alibaba-damo-academy/FunCodec/tree/master
https://github.com/facebookresearch/audiocraft
https://github.com/facebookresearch/encodec
https://github.com/yangdongchao/AcademiCodec

TODO in the following PRs:

  • Decoding
  • Evaluation
  • Deployment
  • Add additional models

@ftshijt
Copy link
Collaborator Author

ftshijt commented Apr 4, 2024

@jctian98 @Jungjee @wyh2000 Hi guys, may I request your review for this PR? Since it has a few framework-level design choices, it would be difficult to change in the later stage.

Copy link
Contributor

@Jungjee Jungjee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ftshijt for your cool work!
LGTM in general.

Two comments from me (maybe close to questions):

  1. Having a gan_codec tasks sounds like there would be other *_codec tasks upcoming. How many tasks do you expect? Is is GAN vs. GAN-irrelevant?
  2. Could we have a folder where we could see all sorts of modules used for the codec tasks? (gan_codec and others) I initially suggested espnet2/codec/layer but after reviewing everything maybe we could put all fundamental modules (that could be re-usable) to espnet2/codec/shared? The idea comes from my assumption that some modules could be used across different *_codec tasks. But maybe this is not the optimal selection.

Comment on lines +131 to +133
##########################################################
# OTHER TRAINING SETTING #
##########################################################
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] is this commenting style a common thing in ESPnet? I just got curious.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is applied to TTS-related config, and I found it very clear so I tried to follow it.

(I always intend to make a consistent style of doc strings and configs in ESPnet but have not found the time yet... )

return x[..., padding_left:end]


class NormConvTranspose1d(nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe these layers could be gathered to a separate folder? Such as "espnet2/codec/layers"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I generally agree with your point on setting a better organization of the modules. But I'm not pretty sure if the above layers are good to be separated. Mostly because they are not beyond the original torch function but just wrappers for the seanet module itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know if you think layers would be still better in the case~

from einops import rearrange


class ModReLU(nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. Better for others to find if this layer would be in layers folder to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above ones, I'm expecting the existing modules to be attached to the major components (the stft discriminator) tightly. In that case, I'm leaning toward saving it here instead of having another set for the layers.

If to make them separate, do you have some suggestions on how the layers folder would be organized? Would be great if you could share a bit more thoughts on this side so that I can understand more on how it could be in better shape. Again, appreciate your advice!

@ftshijt
Copy link
Collaborator Author

ftshijt commented Apr 12, 2024

Thanks @ftshijt for your cool work! LGTM in general.

Two comments from me (maybe close to questions):

  1. Having a gan_codec tasks sounds like there would be other *_codec tasks upcoming. How many tasks do you expect? Is is GAN vs. GAN-irrelevant?
  2. Could we have a folder where we could see all sorts of modules used for the codec tasks? (gan_codec and others) I initially suggested espnet2/codec/layer but after reviewing everything maybe we could put all fundamental modules (that could be re-usable) to espnet2/codec/shared? The idea comes from my assumption that some modules could be used across different *_codec tasks. But maybe this is not the optimal selection.

For 1, I'm following the setting to tts and svs. Using the name has the intention to state the task is trained with gan_trainer instead of the existing normal trainer used in many other tasks. I think we will only support the gan_codec task for codec training for now.

For 2, yeah, I tried to put some shared modules to shared folder.

@jctian98
Copy link
Collaborator

Some comments after discussion with @ftshijt Please let me know if I'm wrong.
(1) Since DAC / soundstream / encoded share the very similar structure like SEANet, quantizer, and overall encoder-decoder-discriminator design, it might be better to have a unified file (current soundstream.py) for them three rather than have the espnet2/gan_codec/{soundstream, encoded, DAC} folders. Instead, we can use if/else to specify the distinctions among these models.
(2) we can also make the loss classified and more configurable. Specifically, the losses can be categorized (my current thinking) adversarial loss / reconstruction loss / loss that specify the similarity between generator decoder and discriminator.

@wyh2000
Copy link
Contributor

wyh2000 commented Apr 17, 2024

Some comments after discussion with @ftshijt Please let me know if I'm wrong. (1) Since DAC / soundstream / encoded share the very similar structure like SEANet, quantizer, and overall encoder-decoder-discriminator design, it might be better to have a unified file (current soundstream.py) for them three rather than have the espnet2/gan_codec/{soundstream, encoded, DAC} folders. Instead, we can use if/else to specify the distinctions among these models. (2) we can also make the loss classified and more configurable. Specifically, the losses can be categorized (my current thinking) adversarial loss / reconstruction loss / loss that specify the similarity between generator decoder and discriminator.

My concern is that for DAC, it actually modifies some loss and quantizer details from Soundstream. It might be clearer for DAC to have a separate dac.py file.

@ftshijt
Copy link
Collaborator Author

ftshijt commented Apr 19, 2024

Consider the following PR (decoding/evaluation), I will merge the current one first. Let's keep discussing better code organization.

@ftshijt ftshijt added the auto-merge Enable auto-merge label Apr 19, 2024
@ftshijt ftshijt merged commit 755c4d7 into espnet:codec Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants