-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Hi, there, just read about your fantastic paper and tried your open source model. The results seem quite promising! Congrats!
I've seen an zero-shot VC speaker-sim score of 0.78, which is even close or better than a fine-tuned version of previous technologies. Now I am trying to fine-tune using my own data! Based on some short experiments, it seems like only fine-tuning the DiT module may not be sufficient to further improve the similarity to a score higher than 0.8.
I noticed that perhaps I haven't fine-tuned the discriminator or the vocos decoder. However, it seems the weights of discriminator are not public yet, is that true? Did I miss something? If would be awesome if you could share those weights.
BTW, what would be the threshold to select a proper prompt utterance of the same speaker? Currently I am using 0.8. Any suggestions?
Many thanks!