Thanks to visit codestin.com
Credit goes to github.com

Skip to content

(Invalid) Overfitting on Seed-TTS Test Set ZH subset? #9

@Plachtaa

Description

@Plachtaa

【重要说明】我在下方最初发表的评论包含不恰当、情绪化、基于误解的指责,并不能代表事实。我保留原文,是为了公开承担错误、接受批评,并提醒自己不要再犯同样的错误。请读者不要将下方原始言论视为真实或可靠的信息。对于作者及项目组,我已在后续评论中正式致歉。

[Important Notice] The original comment below contains inappropriate, emotionally driven, and factually incorrect accusations. I am keeping the original text visible to take responsibility for my mistake, accept all criticism, and remind myself not to repeat such behavior. Please do NOT consider the original statements as accurate or valid. I have issued a formal apology to the authors and maintainers in a later comment.


The paper has presented very impressive evaluation results on its selected test set, which is the ZH subset of Seed-TTS Test Set. As the attached screenshot from paper, in zero-shot voice conversion task with unseen speakers as reference, Mean-VC (14M) has far surpassed baseline systems, which are Seed-VC tiny (25M) and StreamVoice (104M) respectively, in both CER and SSIM.

Image

However, this result is clearly disobeying the scaling law of zero-shot TTS/VC systems, for which larger in parameter size means more timbre diversity is stored in the model as part of its learned knowledge. To dig in more, I ran a evaluation with the released checkpoint of Mean-VC, with default inference setup of 2 NFEs, 200ms chunk size. The evaluation was conducted on both EN and ZH subsets. For fair comparison, I also introduced Vevo and Seed-VC small as baselines. Both of these baseline models are trained on Emilia dataset, which is consistent with Mean-VC as described in its paper. This means, in theory, the three models mentioned above should have a similar distribution of timbre diversity and language coverage.

The table screenshot presented below is the results I obtained:

Image

It is impressive that Mean-VC has achieved very low CER on ZH subset (3.78) with it's compact Fast U2++ content encoder. Although the WER on EN subset (19.65) is quite high, it's understandable as the paper said this Fast U2++ was trained on WenetSpeech only, so it should function well when source speech is Chinese.

However, there is one suspicious fact regarding SSIM metrics presented in the table. Since the three models should have similar distribution on timbre diversity, why does Mean-VC have nearly the highest SSIM (0.73) on ZH subset, while achieving a very poor SSIM (0.4) on EN subset? Comparing to baseline systems, Mean-VC's speaker similarity performance seems strongly biased towards the distribution of ZH subset.

It is hard to believe if the Mean-VC checkpoint participated in this test is the fresh one pretrained on Emilia Dataset only, as there seems no way to explain why it's speaker distribution is so different comparing to other baselines trained on Emilia Dataset. Hence, there exist a possibility that the checkpoint released has gone through some postprocessing or post training to have a nice result on the desired test set.

I hope the author could help give an explanation on the reason behind. Thank you🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions