0% found this document useful (0 votes)

22 views6 pages

Text To Speech Overview - Speech Service

Uploaded by

swapnilr85

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views6 pages

Text To Speech Overview - Speech Service

Uploaded by

swapnilr85

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

What is text to speech?

Article • 07/15/2024

In this overview, you learn about the benefits and capabilities of the text to speech
feature of the Speech service, which is part of Azure AI services.

Text to speech enables your applications, tools, or devices to convert text into human
like synthesized speech. The text to speech capability is also known as speech synthesis.
Use human like prebuilt neural voices out of the box, or create a custom neural voice
that's unique to your product or brand. For a full list of supported voices, languages,
and locales, see Language and voice support for the Speech service.

Core features
Text to speech includes the following features:

ﾉ Expand table

Feature Summary Demo

Prebuilt neural Highly natural out-of-the-box voices. Create an Azure Check the Voice
voice (called account and Speech service subscription, and then use Gallery and
Neural on the the Speech SDK or visit the Speech Studio portal and determine the right
pricing page ) select prebuilt neural voices to get started. Check the voice for your
pricing details . business needs.

Custom neural Easy-to-use self-service for creating a natural brand Check the voice
voice (called voice, with limited access for responsible use. Create an samples .
Custom Neural Azure account and Speech service subscription (with the
on the pricing S0 tier), and apply to use the custom neural feature.
page ) After you're granted access, visit the Speech Studio
portal and select Custom voice to get started. Check
the pricing details .

More about neural text to speech features

Text to speech uses deep neural networks to make the voices of computers nearly
indistinguishable from the recordings of people. With the clear articulation of words,
neural text to speech significantly reduces listening fatigue when users interact with AI
systems.

The patterns of stress and intonation in spoken language are called prosody. Traditional
text to speech systems break down prosody into separate linguistic analysis and
acoustic prediction steps governed by independent models. That can result in muffled,
buzzy voice synthesis.

Here's more information about neural text to speech features in the Speech service, and
how they overcome the limits of traditional text to speech systems:

Real-time speech synthesis: Use the Speech SDK or REST API to convert text to
speech by using prebuilt neural voices or custom neural voices.

Asynchronous synthesis of long audio: Use the batch synthesis API (Preview) to
asynchronously synthesize text to speech files longer than 10 minutes (for
example, audio books or lectures). Unlike synthesis performed via the Speech SDK
or Speech to text REST API, responses aren't returned in real-time. The expectation
is that requests are sent asynchronously, responses are polled for, and synthesized
audio is downloaded when the service makes it available.

Prebuilt neural voices: Microsoft neural text to speech capability uses deep neural
networks to overcome the limits of traditional speech synthesis regarding stress
and intonation in spoken language. Prosody prediction and voice synthesis
happen simultaneously, which results in more fluid and natural-sounding outputs.
Each prebuilt neural voice model is available at 24 kHz and high-fidelity 48 kHz.
You can use neural voices to:
Make interactions with chatbots and voice assistants more natural and
engaging.
Convert digital texts such as e-books into audiobooks.
Enhance in-car navigation systems.

For a full list of platform neural voices, see Language and voice support for the
Speech service.

Fine-tuning text to speech output with SSML: Speech Synthesis Markup

Language (SSML) is an XML-based markup language used to customize text to
speech outputs. With SSML, you can adjust pitch, add pauses, improve
pronunciation, change speaking rate, adjust volume, and attribute multiple voices
to a single document.

You can use SSML to define your own lexicons or switch to different speaking
styles. With the multilingual voices , you can also adjust the speaking languages
via SSML. To fine-tune the voice output for your scenario, see Improve synthesis
with Speech Synthesis Markup Language and Speech synthesis with the Audio
Content Creation tool.
Visemes: Visemes are the key poses in observed speech, including the position of
the lips, jaw, and tongue in producing a particular phoneme. Visemes have a
strong correlation with voices and phonemes.

By using viseme events in Speech SDK, you can generate facial animation data.
This data can be used to animate faces in lip-reading communication, education,
entertainment, and customer service. Viseme is currently supported only for the
en-US (US English) neural voices.

７ Note

We plan to retire the traditional/standard voices and non-neural custom voice in

2024. After that, we'll no longer support them.

If your applications, tools, or products are using any of the standard voices and
custom voices, you must migrate to the neural version. For more information, see
Migrate to neural voices.

Get started
To get started with text to speech, see the quickstart. Text to speech is available via the
Speech SDK, the REST API, and the Speech CLI.

 Tip

To convert text to speech with a no-code approach, try the Audio Content Creation
tool in Speech Studio .

Sample code
Sample code for text to speech is available on GitHub. These samples cover text to
speech conversion in most popular programming languages:

Text to speech samples (SDK)

Text to speech samples (REST)

Custom neural voice

In addition to prebuilt neural voices, you can create and fine-tune custom neural voices
that are unique to your product or brand. All it takes to get started is a handful of audio
files and the associated transcriptions. For more information, see Get started with
custom neural voice.

Pricing note

Billable characters
When you use the text to speech feature, you're billed for each character that's
converted to speech, including punctuation. Although the SSML document itself isn't
billable, optional elements that are used to adjust how the text is converted to speech,
like phonemes and pitch, are counted as billable characters. Here's a list of what's
billable:

Text passed to the text to speech feature in the SSML body of the request
All markup within the text field of the request body in the SSML format, except for
<speak> and <voice> tags

Letters, punctuation, spaces, tabs, markup, and all white-space characters

Every code point defined in Unicode

For detailed information, see Speech service pricing .

） Important

Each Chinese character is counted as two characters for billing, including kanji used
in Japanese, hanja used in Korean, or hanzi used in other languages.

Model training and hosting time for custom neural voice

Custom neural voice training and hosting are both calculated by hour and billed per
second. For the billing unit price, see Speech service pricing .

Custom neural voice (CNV) training time is measured by ‘compute hour’ (a unit to
measure machine running time). Typically, when training a voice model, two computing
tasks are running in parallel. So, the calculated compute hours are longer than the
actual training time. On average, it takes less than one compute hour to train a CNV Lite
voice; while for CNV Pro, it usually takes 20 to 40 compute hours to train a single-style
voice, and around 90 compute hours to train a multi-style voice. The CNV training time
is billed with a cap of 96 compute hours. So in the case that a voice model is trained in
98 compute hours, you'll only be charged with 96 compute hours.

Custom neural voice (CNV) endpoint hosting is measured by the actual time (hour). The
hosting time (hours) for each endpoint is calculated at 00:00 UTC every day for the
previous 24 hours. For example, if the endpoint has been active for 24 hours on day
one, it's billed for 24 hours at 00:00 UTC the second day. If the endpoint is newly created
or suspended during the day, it's billed for its accumulated running time until 00:00 UTC
the second day. If the endpoint isn't currently hosted, it isn't billed. In addition to the
daily calculation at 00:00 UTC each day, the billing is also triggered immediately when
an endpoint is deleted or suspended. For example, for an endpoint created at 08:00 UTC
on December 1, the hosting hour will be calculated to 16 hours at 00:00 UTC on
December 2 and 24 hours at 00:00 UTC on December 3. If the user suspends hosting the
endpoint at 16:30 UTC on December 3, the duration (16.5 hours) from 00:00 to 16:30
UTC on December 3 will be calculated for billing.

Personal voice
When you use the personal voice feature, you're billed for both profile storage and
synthesis.

Profile storage: After a personal voice profile is created, it will be billed until it is
removed from the system. The billing unit is per voice per day. If voice storage
lasts for a period of less than 24 hours, it will be billed as one full day.
Synthesis: Billed per character. For details on billable characters, see the above
billable characters.

Text to speech avatar

When using the text-to-speech avatar feature, charges will be incurred based on the
length of video output and will be billed per second. However, for the real-time avatar,
charges are based on the time when the avatar is active, regardless of whether it is
speaking or remaining silent, and will also be billed per second. To optimize costs for
real-time avatar usage, refer to the tips provided in the sample code (search "Use
Local Video for Idle"). Avatar hosting is billed per second per endpoint. You can suspend
your endpoint to save costs. If you want to suspend your endpoint, you can delete it
directly. To use it again, simply redeploy the endpoint.

Reference docs
Speech SDK
REST API: Text to speech

Responsible AI
An AI system includes not only the technology, but also the people who use it, the
people who are affected by it, and the environment in which it's deployed. Read the
transparency notes to learn about responsible AI use and deployment in your systems.

Transparency note and use cases for custom neural voice

Characteristics and limitations for using custom neural voice
Limited access to custom neural voice
Guidelines for responsible deployment of synthetic voice technology
Disclosure for voice talent
Disclosure design guidelines
Disclosure design patterns
Code of Conduct for Text to speech integrations
Data, privacy, and security for custom neural voice

Next steps
Text to speech quickstart
Get the Speech SDK

Feedback
Was this page helpful?  Yes  No

Provide product feedback | Get help at Microsoft Q&A

Azure Ai Services Speech Service
No ratings yet
Azure Ai Services Speech Service
1,442 pages
Azure Ai Services Speech Service
No ratings yet
Azure Ai Services Speech Service
1,475 pages
Fundamentals of Azure AI Speech With QA
No ratings yet
Fundamentals of Azure AI Speech With QA
6 pages
Custom Speech Service PDF
No ratings yet
Custom Speech Service PDF
38 pages
AI 102T00A ENU PowerPoint - 04
No ratings yet
AI 102T00A ENU PowerPoint - 04
8 pages
Video Transcript - Explore The Text To Speech Technology
No ratings yet
Video Transcript - Explore The Text To Speech Technology
2 pages
Speech Recognition
No ratings yet
Speech Recognition
10 pages
Session 5 - Speech Recognition
No ratings yet
Session 5 - Speech Recognition
20 pages
Speech Recognition
No ratings yet
Speech Recognition
7 pages
Ai102renewal 29-12-23
No ratings yet
Ai102renewal 29-12-23
36 pages
Presentation 3
No ratings yet
Presentation 3
24 pages
Ccs369-Unit 4
No ratings yet
Ccs369-Unit 4
13 pages
Natural Language Processing: Task4
No ratings yet
Natural Language Processing: Task4
12 pages
IJRPR4449
No ratings yet
IJRPR4449
4 pages
Computer Expo
No ratings yet
Computer Expo
6 pages
ISM Report Final
No ratings yet
ISM Report Final
33 pages
Speech Processing
No ratings yet
Speech Processing
70 pages
Speech Recognition in AI (COMP 334)
No ratings yet
Speech Recognition in AI (COMP 334)
26 pages
Speech Recognition Applications TEXT
No ratings yet
Speech Recognition Applications TEXT
7 pages
Text and Speech CCS369-UNIT 5
No ratings yet
Text and Speech CCS369-UNIT 5
9 pages
TTS SRM Speech
No ratings yet
TTS SRM Speech
38 pages
Text 2 Speech Article Summery
No ratings yet
Text 2 Speech Article Summery
2 pages
AI Report 2
No ratings yet
AI Report 2
11 pages
Textobasurasinnecesariedad 5646456464689785
No ratings yet
Textobasurasinnecesariedad 5646456464689785
3 pages
AI Speech Recognition Overview
No ratings yet
AI Speech Recognition Overview
29 pages
Text To Speech
No ratings yet
Text To Speech
1 page
Text To Speech API - ElevenLabs
No ratings yet
Text To Speech API - ElevenLabs
10 pages
Speech Recognition
No ratings yet
Speech Recognition
4 pages
Text-to-Speech Conversion Guide
No ratings yet
Text-to-Speech Conversion Guide
8 pages
Speech Recognition
No ratings yet
Speech Recognition
11 pages
Artificial Intelligence For Speech Recognition
No ratings yet
Artificial Intelligence For Speech Recognition
13 pages
Tsa Ut V
No ratings yet
Tsa Ut V
9 pages
Format of Mini - Project Report
No ratings yet
Format of Mini - Project Report
32 pages
What Are Azure AI Services
No ratings yet
What Are Azure AI Services
5 pages
Convai Technical Overview Speech Ai Part 2 2301964
No ratings yet
Convai Technical Overview Speech Ai Part 2 2301964
11 pages
Interacciones Naturales Con Agentes Digitales y AI Apps
No ratings yet
Interacciones Naturales Con Agentes Digitales y AI Apps
34 pages
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
No ratings yet
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
8 pages
ASR Fundamentals and Techniques
No ratings yet
ASR Fundamentals and Techniques
39 pages
Building A Windows Desktop AI Assistant (Python, Voice I - O, 3D Avatar)
No ratings yet
Building A Windows Desktop AI Assistant (Python, Voice I - O, 3D Avatar)
5 pages
Text-to-Speech Converter Guide
No ratings yet
Text-to-Speech Converter Guide
21 pages
Text To Speech Seminar
No ratings yet
Text To Speech Seminar
10 pages
Neural Speech Synthesis
No ratings yet
Neural Speech Synthesis
63 pages
White Paper - Demystifying Speech Recognition by Charles Corfield - July2012
No ratings yet
White Paper - Demystifying Speech Recognition by Charles Corfield - July2012
5 pages
Text To Speech Conversion
50% (2)
Text To Speech Conversion
13 pages
Conversionof Image, Value, and Text To Speech by Using Machine Learning
No ratings yet
Conversionof Image, Value, and Text To Speech by Using Machine Learning
16 pages
TTSCourseSlides History
No ratings yet
TTSCourseSlides History
28 pages
DSpeech User Guide
No ratings yet
DSpeech User Guide
27 pages
Speech Recognition Technology
No ratings yet
Speech Recognition Technology
23 pages
TTSCourse References
No ratings yet
TTSCourse References
8 pages
Elevenlabs
No ratings yet
Elevenlabs
17 pages
Urk22ai1022 NLP Qa
No ratings yet
Urk22ai1022 NLP Qa
21 pages
Text To Speech
No ratings yet
Text To Speech
14 pages
AI For Speech Recognition Complete
No ratings yet
AI For Speech Recognition Complete
4 pages
Text Tool Report
No ratings yet
Text Tool Report
32 pages
Text To Audio (Team 05)
No ratings yet
Text To Audio (Team 05)
30 pages
Ijisr 15 139 02 PDF
No ratings yet
Ijisr 15 139 02 PDF
7 pages
Thesis
No ratings yet
Thesis
37 pages
Final
No ratings yet
Final
17 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
Leaflet SE For Web
No ratings yet
Leaflet SE For Web
2 pages
Transfer Handover
No ratings yet
Transfer Handover
6 pages
Azure Devops
100% (2)
Azure Devops
25 pages
Copilot Scenario For Operations Conduct A Business Review
No ratings yet
Copilot Scenario For Operations Conduct A Business Review
1 page
AML Service
No ratings yet
AML Service
38 pages
Azure IoT & Unity for Developers
No ratings yet
Azure IoT & Unity for Developers
35 pages
Azure Functions: Serverless Cloud Architecture
100% (1)
Azure Functions: Serverless Cloud Architecture
23 pages
Leverage Multi Factor Authentication Server On Your Premises
No ratings yet
Leverage Multi Factor Authentication Server On Your Premises
50 pages
Stock Market and Indian Economy
No ratings yet
Stock Market and Indian Economy
62 pages
Tugas 3 - PBIS 4102
No ratings yet
Tugas 3 - PBIS 4102
3 pages
Visvesvarayya
0% (2)
Visvesvarayya
28 pages
FSL 10-20-30 Unit Plan
No ratings yet
FSL 10-20-30 Unit Plan
15 pages
Test
0% (2)
Test
64 pages
21-Mahmood Gaznavi Road Lahore
No ratings yet
21-Mahmood Gaznavi Road Lahore
18 pages
500-PG-8700!2!7 - Design of Space Flight Field Programmable Gate Arrays
No ratings yet
500-PG-8700!2!7 - Design of Space Flight Field Programmable Gate Arrays
34 pages
Paper 2 English Study Material Dinesh
No ratings yet
Paper 2 English Study Material Dinesh
28 pages
Theoretical Phonetics Test Questions
No ratings yet
Theoretical Phonetics Test Questions
5 pages
Tut Letter SEVEN
No ratings yet
Tut Letter SEVEN
51 pages
M L Dahanukar College of Commerce
No ratings yet
M L Dahanukar College of Commerce
6 pages
Comprehension Quiz
No ratings yet
Comprehension Quiz
3 pages
Formative Assessment (Fa 1) : Delhi Public School Ghaziabad Secondary Wing Assessment Plan - 2013-14 Class Ix
No ratings yet
Formative Assessment (Fa 1) : Delhi Public School Ghaziabad Secondary Wing Assessment Plan - 2013-14 Class Ix
3 pages
NAT Reading & Writing Practice
No ratings yet
NAT Reading & Writing Practice
7 pages
16-00400-21 Masteravhandling - Reading Comprehension, Elin Jorde Hansen - Docx 267995 - 1 - 1
No ratings yet
16-00400-21 Masteravhandling - Reading Comprehension, Elin Jorde Hansen - Docx 267995 - 1 - 1
70 pages
Andrew F. Santos: Curriculum Vitae
No ratings yet
Andrew F. Santos: Curriculum Vitae
4 pages
IsiZulu HL P2 Nov 2023
50% (4)
IsiZulu HL P2 Nov 2023
32 pages
Nanda Resume
No ratings yet
Nanda Resume
2 pages
Passive Voice
No ratings yet
Passive Voice
3 pages
Using Modals To Make Polite Requests
100% (1)
Using Modals To Make Polite Requests
4 pages
Gujarati IME Readme
No ratings yet
Gujarati IME Readme
5 pages
Day 1 - 71 Current Affairs Lectures For 71st BPSC Prelims
No ratings yet
Day 1 - 71 Current Affairs Lectures For 71st BPSC Prelims
21 pages
Hiligaynon Language Guide
100% (1)
Hiligaynon Language Guide
6 pages
Third Engineer Resume
No ratings yet
Third Engineer Resume
2 pages
English As A Second Language: P53373A0108 P53373A0208
100% (1)
English As A Second Language: P53373A0108 P53373A0208
4 pages
Corpus
No ratings yet
Corpus
21 pages
Totemism and Exogamy
No ratings yet
Totemism and Exogamy
664 pages
1 Complete The Sentences With The Correct Form of These Verbs. Some Verbs Are Used More Than Once
No ratings yet
1 Complete The Sentences With The Correct Form of These Verbs. Some Verbs Are Used More Than Once
4 pages
3rd Term Week 1-10 Note.
No ratings yet
3rd Term Week 1-10 Note.
13 pages
History of The Arabic Script Article Arabic
No ratings yet
History of The Arabic Script Article Arabic
27 pages
Solved SSC CHSL 4 July 2019 Shift-1 Paper With Solutions
No ratings yet
Solved SSC CHSL 4 July 2019 Shift-1 Paper With Solutions
36 pages