Question in the inference 

the required spectrogram form is like [N,C,W].

spectrogram = # get your hands on a spectrogram in [N,C,W] format

could you please explain these three dimensions? 

I use the code from this repo: https://github.com/CorentinJ/Real-Time-Voice-Cloning  to produce the mel spectrogram and use diffwave as the vocoder. But I only get the audio full of noises.

## generate mel spectrogram
specs = synthesizer.synthesize_spectrograms(texts, embeds)  #len(specs) == 1
spec = specs[0]   #spec numpy.array, float32, shape(80, 314)
spec = torch.tensor(spec)

## Generating the waveform
diffwave_dir = "/hdd/haoran_project/diffwave-master/pretrained_models/diffwave-ljspeech-22kHz-1000578.pt"
generated_wav, sample_rate = diffwave_predict(spec, diffwave_dir, fast_sampling=True)

## Save it on the disk
filename = "results/diffwave_Elon.wav"
print(generated_wav.dtype, "  ",generated_wav.shape) # torch.float32    torch.Size([1, 87040])
torchaudio.save(filename, generated_wav.cpu(), sample_rate=sample_rate)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question in the inference #51

generate mel spectrogram

Generating the waveform

Save it on the disk

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question in the inference #51

Description

generate mel spectrogram

Generating the waveform

Save it on the disk

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions