A Deep Dive into Amazon Polly

In a previous Medium article, I discussed using Amazon Polly as a method of converting my print articles to audio. While I presented a Lambda script to take a file from S3 and generate an MP3 audio file, I didn’t discuss the why and how of Amazon Polly.

Polly is a text to speech platform capable of handling both plain text and Speech Synthesis Markup Language (SSML). If you want to think of it this way, Amazon Polly and Amazon Transcribe are opposite services. Polly provides text to speech, while Transcribe provides speech to text. Since Polly takes text and converts it to audio, we might think of Amazon Polly as a media service. Both Amazon Polly and Amazon Transcribe are part of the AI/ML service family provided by AWS.

Polly creates very lifelike audio from the supplied text in a variety of voices and dialects. When using plain text with Polly, the punctuation in the text is used as cues for pauses and breaks. Even with plain text, Polly can generate pretty realistic audio. Much more granular control over breaks, intonation, and speed are available using SSML.

Text is converted either using Amazon Polly in the AWS Console, through the AWS Command Line Interface (CLI) or an API provided by the AWS Software Development Kit and your application code. Let’s have a look at creating a simple audio file using Amazon Polly in the AWS Console.

Image for post

We can type our text in the text box, select the sample rate (Standard or Neural), select the voice, and then either listen to the audio file immediately, download the MP3 file, or save the file to an S3 bucket.

This is what our audio sounds like.

(If the audio player doesn’t show up, click on this link.)

Once the text has been converted, it can be used over and over again in applications or served from a storage location like Simple Storage Service (S3). Polly also supports real-time audio streaming, so the text sent using the API is returned immediately for use in your application.

Changing How Polly Behaves

I already mentioned a couple of things we can change about how the speech synthesis is performed. We can adjust the synthesis engine, selecting either Standard for a sample rate of 22,050 Hz (22.5 kHz), or Neural for a sample rate of 24,000 Hz (24 kHz). The Neural option also allows using several neural engine voices, including a Newscaster reading style. Let’s hold that thought until later in this article.

We can also change the voice used in the speech synthesis task. Amazon Polly supports 61 different male and female voices, across 29 languages. You can see the full list of voices and languages here.

Amazon Polly supports MP3, OGG, and PCM audio, with sample rates of 8000 Hz, 16000 Hz, 22050 Hz, and 24000 Hz. We can also output speech marks when we download the audio file, but again, let’s cover that a little later.

Finally, we can synthesize the audio to an S3 bucket. If your application is small audio files, you can convert and download these small audio files in the Console, so long as the volume of text is less than 3,000 characters. If you need more than 3,000 and up to 100,000 characters, the synthesis process must save the file in an S3 bucket.

Standard and Neural Engines

Creating realistic-sounding speech from text is why Amazon Polly is part of the AI/ML service family. We have to consider how we talk to each other, and how we change our speech patterns when we are talking to our spouse, a coworker, in a public forum, or even in specific jobs.

As AWS explains in the Amazon Polly documentation, we typically use a conversational style, which is supported by both the Standard and Neural speech engines. The Standard engine uses what is called concatenative synthesis, which takes the generated speech sounds and concatenates them to form the words and phrases.

When we use the Neural Engine, which not only offers a higher sample rate, the speech sounds are created using a sequence-to-sequence model instead of concatenative synthesis. This model involves not only creating the sounds but also how the sounds fit together. The result is then processed to generate the actual speech waveforms.

Using the neural engine allows us to not only use different voices but change the speaking style. The Conversational style is more friendly and expressive but is only available for Matthew and Joanna voices, which only support American English. Alternatively, there is the Newscaster style, which is much more formal, like what we hear on radio and television. This style is also only available in American English using the Matthew and Joanna voices.

The Neural engine is not available in all regions, and it does not support all of the Polly features according to the Documentation. The Neural engine is available in North Virginia (us-east-1), Oregon (us-west-2), Ireland (eu-west-1) and Sydney (ap-southeast-2). Real-time and asynchronous synthesis tasks are supported, in addition to speech marks, the Newscaster and Conversational styles but not all of the SSML tags.

Let’s try out the difference between the Conversational and Newscaster styles. Here is the command which was executed to generate these two audio samples:

aws polly start-speech-synthesis-task \
--engine neural \
--region us-west-2 \
--endpoint-url "https://polly.us-west-2.amazonaws.com/" \
--output-format mp3 \
--output-s3-bucket-name chris.hare \
--voice-id Joanna \
--text file://war1.txt

The section of text being used for these examples is from the October 30, 1938, Orson Welles’ broadcast War of the Worlds. The speech selected is one of the announcers. The tags indicate this is an SSML document, just like * is used for HTML documents. We are going to discuss SSML later in this article.

Here is the newscaster style audio.

If the audio player isn’t visible, click this link.

To synthesize this audio, we have to use SSML and specify the speech style. This is the SSML file used to synthesize the speech:

The key indicator in the file that we are using the newscaster style, is the tag in the following text.

<speak> 
<amazon:domain name="news">
Ladies and gentlemen, we interrupt our program of dance music to bring you a special bulletin from the Intercontinental Radio News. At twenty minutes before eight, central time, Professor Farrell of the Mount Jennings Observatory, Chicago, Illinois, reports observing several explosions of incandescent gas, occurring at regular intervals on the planet Mars.
</amazon:domain>
</speak>

Here is the conversational style audio.

If the audio player isn’t visible, click this link.

To synthesize this audio, we have to use SSML and specify the speech style. This is the SSML file used to synthesize the speech:

<speak> 
<amazon:domain name="conversational">
Ladies and gentlemen, we interrupt our program of dance music to bring you a special bulletin from the Intercontinental Radio News. At twenty minutes before eight, central time, Professor Farrell of the Mount Jennings Observatory, Chicago, Illinois, reports observing several explosions of incandescent gas, occurring at regular intervals on the planet Mars.
</amazon:domain>
</speak>

The tag is used to indicate this is either a conversational style or newscaster style. In the conversational example, we can see we are using this style because of the tag.

We can hear there is a difference between the two styles. Here is the newscaster version using the male voice, Matthew. I think there is a stronger newscaster presence with the male voice than the female.

If the audio player isn’t visible, click this link.

We are going to look at SSML in a little more detail later in the article.

Speech Marks

Speech marks add metadata to the audio, indicating when a word starts and ends. For many people, this isn’t important, however, if you are combining animation and audio, then you need to know when each word starts and ends, so it is possible to synchronize the facial expressions with the speech.

Another example of using speech marks is when you want to highlight words as they are spoken, such as in subtitles.

Depending upon your situation, there are four possible types of speech marks. These are sentence, word, viseme and ssml. The sentence and word speech marks are self-explanatory. The viseme speech mark describes the facial actions as each phoneme is pronounced.

A phoneme is a unit in the phonetic system of a language corresponding to a set of similar speech sounds.

Using speech marks involves performing a speech synthesis task, and requesting speech marks. We are going to look at speech synthesis in more detail a little later in this article.

Adjusting Pronunciation

Amazon Polly is pretty good at getting pronunciations right, however, there may be times when you want to change how words are pronounced. We can accomplish this using a lexicon.

For example, we might put “LOL” in the text, which Polly will read out as “ell oh hell”. We might want to have Polly say the phrase “laughing out loud” instead of reading the letters verbatim. Lexicons help us accomplish this.

Defining a Lexicon

Lexicon definitions follow the Pronunciation Lexicon Specification (PLS) Version 1.0 published by the World Wide Web Consortium. This standard describes the layout of the lexicon using a specific structure, illustrated in this example from the specification website.

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang="en-US">
<lexeme>
<grapheme>tomato</grapheme>
<phoneme>təmei̥ɾou̥</phoneme>
<!-- IPA string is: "t&#x0259;mei&#x325;&#x027E;ou&#x325;" -->
</lexeme>
</lexicon>

This example provides the pronunciation of the word tomato. Getting this right means you will need some knowledge of phonemes and how to describe how words sound using the phonetics of the language you are working in. You can read more about phonemes and the phonetics in language in this multi-part series.

As anyone who is not a native English speaker (and even some of us who are), won’t be surprised to learn that English is not phonetic as the sounds we make to say a word are often different than how we write it.

Using our previous example of having Polly say the phrase “laugh out loud” in place of LOL, we would define the lexicon this way:

<lexeme>
<grapheme>LOL</grapheme>
<alias>laugh out loud</alias>
</lexeme>

We can apply up to five different lexicons to our speech synthesis task, and we can have the same grapheme appear in multiple lexicons. We can control which lexicon is used for the grapheme by specifying the order the lexicons are applied.

Using a Lexicon

When we submit a speech synthesis task, we specify the lexicon(s) we want to use, and the order they should be applied.

For example, if we are executing a speech synthesis task using the AWS CLI and want to specify a lexicon, we would do it this way:

aws polly start-speech-synthesis-task \
--lexicon-names LexA \
--voice-id Joanna \
--output-format mp3 \
--output-s3-bucket-name BUCKET \
--text "Hello Chris. This audio sample was created using Amazon Polly and the AWS Command Line Interface. LOL." \
--region us-west-2

In this example, we are telling Polly to use a lexicon file named “LexA”.

When we execute this command, the response is

An error occurred (LexiconNotFoundException) when calling the StartSpeechSynthesisTask operation: Lexicon not found

That is not what we were expecting to see. Before we can use the lexicon, we have to upload it using the put-lexicon command to Polly. The lexicon file must have either a .pls or .xml extension for the operation to be successful.

$ aws polly put-lexicon \
--name LexA \
--content file://LexA.pls
$

We can verify the lexicon has been uploaded using the list-lexicons command.

$ aws polly list-lexicons
{
"Lexicons": [
{
"Name": "LexA",
"Attributes": {
"Alphabet": "ipa",
"LanguageCode": "en-US",
"LastModified": 1584848595.271,
"LexiconArn": "arn:aws:polly:us-east-1:548985610555:lexicon/LexA",
"LexemesCount": 1,
"Size": 474
}
}
]
}
$

Let’s try running the same command again.

$ aws polly start-speech-synthesis-task \
--lexicon-names LexA \
--voice-id Joanna \
--output-format mp3 \
--output-s3-bucket-name BUCKET \
--text "Hello Chris. This audio sample was created using Amazon Polly and the AWS Command Line Interface. LOL." \
--region us-west-2

An error occurred (LexiconNotFoundException) when calling the StartSpeechSynthesisTask operation: Lexicon not found
$

Did you catch the problem? Amazon Polly cannot find the lexicon, even after we execute the put-lexicon command. If we look at the output of the put-lexicon command, we see the ARN is arn:aws:polly:us-east-1:548985610555:lexicon/LexA, telling us the lexicon is stored in us-east-1. Now, if you look at the speech synthesis task, you will see we are executing it in us-west-2. So, it makes sense that Polly cannot find the lexicon.

Not to worry, we’ll just re-run the put-lexicon command and specify the region. Well, that gives you an error, because the put-lexicon command doesn’t accept — region as an argument. The put-lexicon command gets the region to store the lexicon from your AWS CLI configuration. This means, for me to be able to execute my speech synthesis task, I have to change the default region using aws configure to us-west-2.

After I make this change and run the put-lexicon command, my speech synthesis task successfully executes.

$ aws polly start-speech-synthesis-task \
--lexicon-names LexA \
--voice-id Joanna \
--output-format mp3 \
--output-s3-bucket-name BUCKET \
--text "Hello Chris. This audio sample was created using Amazon Polly and the AWS Command Line Interface. LOL." \
--region us-west-2

{
"SynthesisTask": {
"TaskId": "d1d2c424-3417-4ac7-a156-b7b7a6b83ee6",
"TaskStatus": "scheduled",
"OutputUri": "https://s3.us-west-2.amazonaws.com/BUCKET/d1d2c424-3417-4ac7-a156-b7b7a6b83ee6.mp3",
"CreationTime": 1584851176.31,
"RequestCharacters": 102,
"LexiconNames": [
"LexA"
],
"OutputFormat": "mp3",
"TextType": "text",
"VoiceId": "Joanna"
}
}

Here is the resulting audio file.

If the audio player isn’t visible, click this link.

So, the text LOL was replaced using the lexicon to laughing out loud.

The lexeme in the lexicon must exactly match the word in the text. lol and LOL are not the same as far as the lexicon engine is concerned.

Now that we have seen how to perform a speech synthesis and control both pronunciation and using aliases to substitute text, how can we make Polly sound even more realistic?

More Realism using SSML

Amazon Polly is capable of generating pretty realistic speech using plain text. However, we can gain significant control when using Speech Synthesis Markup Language (SSML). Like other markup languages, SSML uses tags to change how the speech sounds. In this article, we won’t be discussing every capability of SSML.

We have seen some SSML already in this article. SSML is a markup language that uses tags to instruct the interpreter what needs to be done. Like we use <html></html> to start and end HTML documents, everything in SSML is enclosed within the <speak></speak> tags. Let's go back and look at the SSML for our first example, and examine how we can change the sound of the synthesized speech using different SSML tags.

aws polly start-speech-synthesis-task \
--voice-id Joanna \
--output-format mp3 \
--output-s3-bucket-name BUCKET \
--text "This audio sample was created using Amazon Polly and the AWS Command Line Interface." \
--region us-west-2

Wait! This isn’t SSML. So let’s build our SSML document using this sample text.

<speak>
This audio sample was created using Amazon Polly and uses the voice Joanna with the default settings.
</speak>

Amazon Polly supports these SSML tags.

Break

If we want to add a break in the way in this speech is synthesized, we can use the <break/> tag and add a duration for the pause using either a default value or by specifying a time in milliseconds. We don't have to insert breaks for punctuation, as Polly is smart enough to insert a break with a pre-defined delay when it encounters punctuation. For example, let's add a 1-second delay in our text.

<speak>
This audio sample was created using Amazon Polly and uses the voice Joanna with the default settings.
This sentence has a <break time="1s"/> 1 second pause.
</speak>

Emphasis

We can emphasize words using the <emphasis></emphasis> tag, which can increase or decrease the volume and adjust the speaking rate. It only accepts predefined emphasis keywords.

<speak>
This audio sample was created using Amazon Polly and uses the voice Joanna with the default settings.
This sentence has a <break time="1s"/> 1 second pause.
Spot is being a <emphasis level="strong">bad dog</emphasis>.
</speak>

Sentence and Paragraph Pauses

Typically when we speak, we pause between sentences and paragraphs. While Polly understands adding a pause based upon punctuation, we can use the explicit <s></s> and <p></p> to adjust the pause duration.

<speak>
This audio sample was created using Amazon Polly and uses the voice Joanna with the default settings.
This sentence has a <break time="1s"/> 1 second pause.
Spot is being a <emphasis level="strong">bad dog</emphasis>.
<p>This sentence uses an SSML p tag, which inserts a longer delay at the end of the sentence.</p>
<s>This is a sentence enclosed in an SSML s tag.</s></speak>
</speak>

Phoneme Tag

Sometimes, we want words to be pronounced a specific way, possibility to highlight the region where the “person” is from. Remember the phrase “you say potato, I say potato”? Here is an example of adding the <phoneme></phoneme> tag to adjust how the voice pronounces the word "pecan".

<speak>
This audio sample was created using Amazon Polly and uses the voice Joanna with the default settings.
This sentence has a <break time="1s"/> 1 second pause.
Spot is being a <emphasis level="strong">bad dog</emphasis>.
<p>This sentence uses an SSML p tag, which inserts a longer delay at the end of the sentence.</p>
<s>This is a sentence enclosed in an SSML s tag.</s></speak>
You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak>

When we synthesize this text, we will hear two different pronunciations for “pecan”. We’ll synthesize the entire example a little later.

Prosody Tag

Probably the most interesting, and challenging tag is <prosody></prosody>, which allows us to control the speaking rate, volume, and pitch of the associated speech.

We can change the volume of the synthesized text using the <prosody volume="level"></prosody> tag. There are a few predefined keywords, or you can specify the volume change in decibels. We can also adjust the speaking rate, using the rate keyword in the tag. For example, when we are excited, we often speak louder, and much faster.

We can also adjust the pitch of the voice by using the pitch keyword. We can use one of the predefined values, or a percentage. Unfortunately, we cannot adjust the pitch by frequency. Here is our SSML file thus far, with some formatting added for clarity.

<speak>
This audio sample was created using Amazon Polly and uses the voice Joanna with the default settings.
This sentence has a <break time="1s"/> 1 second pause.
Spot is being a <emphasis level="strong">bad dog</emphasis>.
<p>
This sentence uses an SSML p tag, which inserts a longer delay at the end of the sentence.
</p>
<s>
This is a sentence enclosed in an SSML s tag.
</s>
<p>
You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">
pecan
</phoneme>.
I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">
pecan
</phoneme>.
</p>
<p>Sometimes, during a speech, we want to
<prosody volume="-6dB">
emphasize specific phrases by speaking at a lower volume,
</prosody>
<prosody volume="loud" rate="fast">
or by showing excitement about what we are saying.
</prosody>
</p>
<p>
Sometimes when we get really excited, not only do we get
<prosody volume="loud" rate="fast">
louder, and talk faster, but our
</prosody>
<prosody volume="loud" rate="fast" pitch="+20%">
pitch increases.
</prosody>
</p>
</speak>

It is not necessary to format the SSML this way, but it makes it easier to see where you might be missing a tag.

The Amazon Polly documentation says you can nest <prosody> tags, but while writing this article I found this was not the case. Or at least I wasn't able to make it work.

Before we look at several more interesting SSML tags, lets’ hear what our example sounds like. To synthesize the speech, we’ll use the command

$ aws polly start-speech-synthesis-task \
--region us-west-2 \
--endpoint-url "https://polly.us-west-2.amazonaws.com/" \
--output-format mp3 \
--output-s3-bucket-name YOUR-BUCKET \
--voice-id Joanna \
--text-type ssml \
--output text \
--text file://ssml-1.ssml

The result is this audio file.

If the audio player isn’t visible, click this link.

To wrap up this section, it is important to review the tags you wish to use, the voice and the engine. While the tags should be available for all voices, not all of the SSML tags are supported for the neural engine.

If you are experiencing problems with a specific tag with the neural engine, check the tag documentation to verify it is supported by the neural engine.

There are many other SSML tags which were not presented in this article to further enhance the speech synthesis. If you want to get deeper into SSML, check out the Amazon Polly documentation, specifically the supported SSML tags, and how they work.

Polly Operations

Amazon Polly has several functions available in the AWS Console, the CLI, and the SDK. These are (grouped by function, with a sample execution using the AWS CLI).

Speech Synthesis

  • start-speech-synthesis-task — This action takes plain text or valid SSML as input and performs an asynchronous speech synthesis. This action requires specifying an S3 bucket, which is where the synthesized speech is saved when the task is completed. This action also accepts an SNS topic, allowing you to be notified when the tasks are complete.
$ aws polly start-speech-synthesis-task \
--voice-id Joanna \
--output-format mp3 \
--output-s3-bucket-name BUCKET \
--text "This audio sample was created using Amazon Polly and the AWS Command Line Interface."
{
"SynthesisTask": {
"TaskId": "27df874d-77fa-469a-bc63-6f54f521276a",
"TaskStatus": "scheduled",
"OutputUri": "https://s3-endpoint/bucket/27df874d-77fa-469a-bc63-6f54f521276a.mp3",
"CreationTime": 1583651313.015,
"RequestCharacters": 84,
"OutputFormat": "mp3",
"TextType": "text",
"VoiceId": "Joanna"
}
$
  • synthesize-speech — This function takes plain text or SSML and synthesizes the speech output as a stream of bytes. If using SSML, the input must be valid SSML or the request fails. This is a synchronous speech synthesis action, where the audio file can be saved to your local device.
$ aws polly synthesize-speech \
--voice-id Joanna \
--output-format mp3 \
--text "This audio sample was created using Amazon Polly and the AWS Command Line Interface." \
polly-example-2.mp3
{
"ContentType": "audio/mpeg",
"RequestCharacters": "84"
}
$ ls -l *.mp3
-rw-r--r-- 1 roberthare staff 32019 Mar 8 00:14 polly-example-2.mp3
$
  • get-speech-synthesis-task — This function retrieves the information about a specific speech synthesis task, submitted using the start-speech-synthesis-task command. Executing the command requires the TaskId, which is provided when the task is created. Assuming we created a task in us-west-2 and we have the task id, this is the sample output.
$ aws polly get-speech-synthesis-task \
--task-id e9552b8c-0d94-4063-b858-c8e0b0ae1aa6 \
--region us-west-2
{
"SynthesisTask": {
"TaskId": "e9552b8c-0d94-4063-b858-c8e0b0ae1aa6",
"TaskStatus": "completed",
"OutputUri": "https://s3.us-west-2.amazonaws.com/chris.hare/e9552b8c-0d94-4063-b858-c8e0b0ae1aa6.mp3",
"CreationTime": 1583652226.841,
"RequestCharacters": 84,
"OutputFormat": "mp3",
"TextType": "text",
"VoiceId": "Joanna"
}
}
  • list-speech-synthesis-tasks — List all the speech synthesis tasks which have been submitted using the start-speech-synthesis-task command. This function can limit the responses by the status of the task.
{
"SynthesisTasks": {
"TaskId": "27df874d-77fa-469a-bc63-6f54f521276a",
"TaskStatus": "failed",
"TaskStatusReason": "Error occurred while trying to upload file to S3. Please verify that the bucket exists in this region and you have permission to write objects to the specified bucket.",
"OutputUri": "https://s3.us-east-1.amazonaws.com/chris.hare/27df874d-77fa-469a-bc63-6f54f521276a.mp3",
"CreationTime": 1583651313.015,
"RequestCharacters": 84,
"OutputFormat": "mp3",
"TextType": "text",
"VoiceId": "Joanna"
}
}

This sample output illustrates an important point when using the start-speech-synthesis-task command, specifically, that the task request must be made in the same region as the S3 bucket. For example, if the request is submitted in us-east-1 and the bucket was created in us-west-2, Amazon Polly will not be able to save the file to the bucket.

Lexicon Management

  • put-lexicon — This command allows you to store your lexicon in your AWS account in the region you specify. However, the put-lexicon command doesn’t accept the region as an argument. If you want to store your lexicon in a specific region, you need to run the aws configure command first and then execute aws polly put-lexicon. The lexicon file name must end in either .xml or .pls. There is no response from Polly with this command.
aws polly put-lexicon --name LexA --content file://LexA.pls
  • list-lexicons — This command lists the lexicons you have stored. Unlike get-lexicon, list-lexicons does not provide the actual lexicon contents. The information returned is essentially the metadata provided by the get-lexicon command. You cannot use the — region* option with this command.
$ aws polly list-lexicons
{
"Lexicons": [
{
"Name": "LexA",
"Attributes": {
"Alphabet": "ipa",
"LanguageCode": "en-US",
"LastModified": 1584848595.271,
"LexiconArn": "arn:aws:polly:us-east-1:548985610555:lexicon/LexA",
"LexemesCount": 1,
"Size": 474
}
}
]
}
$
  • get-lexicon — This command allows you to retrieve the lexicon with the specified name, provided it exists. The response includes the lexicon content, along with the associated metadata. If you are sure there should be a lexicon, check the default region in your AWS CLI configuration, as you cannot specify a region with this command.
$ aws polly get-lexicon --name LexA
{
"Lexicon": {
"Content": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<lexicon version=\"1.0\" \n xmlns=\"http://www.w3.org/2005/01/pronunciation-lexicon\"\n xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" \n xsi:schemaLocation=\"http://www.w3.org/2005/01/pronunciation-lexicon \n http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd\"\n alphabet=\"ipa\" xml:lang=\"en-US\">\n <lexeme>\n <grapheme>lol</grapheme>\n <alias>laugh out loud</alias>\n </lexeme>\n</lexicon>\n",
"Name": "LexA"
},
"LexiconAttributes": {
"Alphabet": "ipa",
"LanguageCode": "en-US",
"LastModified": 1584848595.271,
"LexiconArn": "arn:aws:polly:us-east-1:548985610555:lexicon/LexA",
"LexemesCount": 1,
"Size": 474
}
}
$
  • delete-lexicon — As you can imagine, the delete-lexicon command will delete the specified lexicon. There is no output from the command, and you cannot provide the — region option. You can use the list-lexicons command to verify the lexicon was deleted.

Other Functions

  • describe-voices — This function returns the list of available voices for a speech synthesis task. Each voice speaks a specific language and is either male or female. When submitting a speech synthesis task, we have to provide the voice ID returned by describe-voices. If you provide the user with the ability to select a gender and language, you can select the voice and then submit the synthesis task.
$ aws polly describe-voices --language-code en-US
{
"Voices": [
...
{
"Gender": "Female",
"Id": "Joanna",
"LanguageCode": "en-US",
"LanguageName": "US English",
"Name": "Joanna",
"SupportedEngines": [
"neural",
"standard"
]
},
{
"Gender": "Male",
"Id": "Matthew",
"LanguageCode": "en-US",
"LanguageName": "US English",
"Name": "Matthew",
"SupportedEngines": [
"neural",
"standard"
]
},
...
]
}
  • help — The help function can be used as the only argument to get a full help listing, or add it as the last argument with a function to get help on that specific function.
$ aws polly help
$ aws polly describe-voices help

Speech Synthesis and Speech Marks

We discussed what speech marks are earlier in the article. In this section, we are going to look at creating the speech marks for use in animation and other video forms.

Earlier in the article, we looked at using the Neural engine with the Conversational and Newscaster styles using an excerpt from the 1938 “War of the Worlds” broadcast. I am going to include the text again here so you don’t have to refer to back to it for this part of the discussion.

<speak> 
<amazon:domain name="conversational">
Ladies and gentlemen, we interrupt our program of dance music to bring you a special bulletin from the Intercontinental Radio News. At twenty minutes before eight, central time, Professor Farrell of the Mount Jennings Observatory, Chicago, Illinois, reports observing several explosions of incandescent gas, occurring at regular intervals on the planet Mars.
</amazon:domain>
</speak>

We have an audio file from the speech synthesis task, but we want the speech marks associated with the synthesized speech. To do this, we execute the command:

aws polly synthesize-speech \
--engine neural \
--region us-west-2 \
--output-format json \
--voice-id Matthew \
--text-type ssml \
--text file://war1.txt \
--speech-mark-types='["sentence", "word", "viseme"]' \
war1-speech-marks.txt

The command indicates we want to generate speech marks for each sentence, each word, and each viseme. This will generate a lot of output in the file war1-speech-marks.txt. The input format is SSML, which is required only for SSML speech marks (the other types support plain text), and the only supported format for speech mark output is JSON. After we execute the command, the API returns the following information

{
"ContentType": "application/x-json-stream",
"RequestCharacters": "358"
}

Indicating the speech synthesis task has completed, with 358 characters in the input. The output looks like this:

{"time":0,"type":"sentence","start":37,"end":168,"value":"Ladies and gentlemen, we interrupt our program of dance music to bring you a special bulletin from the Intercontinental Radio News."}
{"time":62,"type":"word","start":37,"end":43,"value":"Ladies"}
{"time":62,"type":"viseme","value":"t"}
{"time":125,"type":"viseme","value":"e"}
...

We can see a speech mark for the first sentence, the word “Ladies”, and the first few visemes. Each entry in the JSON file follows the structure

{
"time" : the timestamp in milliseconds from the beginning of the corresponding audio stream,
"type" : the type of speech mark (sentence, word, viseme, or ssml),
"start" : the offset in bytes (not characters) of the start of the object in the input text (not including viseme marks),
"end" : the offset in bytes (not characters) of the object's end in the input text (not including viseme marks),
"value" : this varies depending on the type of speech mark
}

The value key can contain either an

  • SSML speech mark, which includes the SSML keyword;
  • The viseme name; or,
  • The word or sentence from the input text delimited by the start and end values.

Speech marks generated with one voice are not necessarily the same as speech marks for a different voice. Consequently, if you change the Voice ID, it is necessary to regenerate the speech marks.

With the speech marks generated, you can use them and the synthesized audio output with services like Amazon Sumerian to create a Virtual Reality character.

Console vs. CLI vs. API

Amazon Polly can be used in the AWS Console if you have small amounts of text to be converted once, whether it is plain text or SSML. As we have seen, we can synthesize the text and download the file to our device.

Incidentally, using either Safari or Chrome on an iPad to download the synthesized audio file doesn’t work.

We have already seen how we can use the AWS Command Line Interface (CLI). We can also use the AWS Shell to submit speech synthesis requests (which we will briefly see in a few minutes). We can also use the AWS Software Development Kit (SDK) for the programming language of your choice. The programming languages supported by the SDK (and related development tools) are listed in the AWS SDK documentation.

Using the API provides us much more capability in terms of being able to generate audio streams we can integrate directly into an application instead of working with audio files. We can submit a synthesis request, monitor its progress and take action when it is complete or handle an error. The Amazon Polly API contains the same functions as we have seen in the CLI and the Console.

Amazon Polly and Amazon Transcribe

In a previous Medium article, I discussed using Amazon Transcribe to convert audio into text. Like Amazon Polly, Transcribe is part of the AI/ML family of services.

Amazon Transcribe is a highly scalable, on-demand speech to text transcription service. It is capable of analyzing multiple audio formats from files stored in S3 and provides an accurate transcription, including timestamps for each word. Amazon Transcribe can also operate on streaming audio, providing a stream of transcribed text in real-time.

We have seen how Amazon Polly can be used to create almost lifelike speech from text, and my Amazon Transcribe Medium article showed how Transcribe was pretty good at converting spoken audio into text. Which got me wondering: since I have the original text I synthesized with Polly, how well will Transcribe do converting the audio back to text?

I am not going to cover all of the capabilities of Amazon Transcribe, but I will demonstrate how we submit a transcription job through the CLI. To submit a transcription job, the audio file must be available in an S3 bucket.

Here is the transcription request using the AWS CLI.

aws transcribe start-transcription-job \
--transcription-job-name "newscaster-male" \
--language-code en-US \
--media-format mp3 \
--media "MediaFileUri=s3://chris.hare/26fd02f9-b5b4-480f-8e21-e75fc001efef.mp3" \
--region us-west-2

Once we execute this command, the CLI responds with information about the request.

{
"TranscriptionJob": {
"TranscriptionJobName": "newscaster-male",
"TranscriptionJobStatus": "IN_PROGRESS",
"LanguageCode": "en-US",
"MediaFormat": "mp3",
"Media": {
"MediaFileUri": "s3://BUCKET/26fd02f9-b5b4-480f-8e21-e75fc001efef.mp3"
},
"StartTime": 1585890134.042,
"CreationTime": 1585890134.012
}
}

We can then monitor the transcription job until the status is complete, using the list-transcription-jobs command.

$ aws transcribe list-transcription-jobs --region us-west-2
{
"TranscriptionJobSummaries": [
{
"TranscriptionJobName": "newscaster-male",
"CreationTime": 1585890134.012,
"StartTime": 1585890134.042,
"CompletionTime": 1585890189.432,
"LanguageCode": "en-US",
"TranscriptionJobStatus": "COMPLETED",
"OutputLocationType": "SERVICE_BUCKET"
}
]
}

Let’s look at the transcription output. Well, not quite yet. In the start-transcription-job, I didn’t specify a location to store the transcription. In this case, Amazon Transcribe stores the transcription in a bucket that you can retrieve using a pre-signed URL. The only way to get this URL is through the get-transcription-job command.

aws transcribe get-transcription-job --transcription-job-name "newscaster-male" --region us-west-2

After executing this command, we will get the information on the job.

{
"TranscriptionJob": {
"TranscriptionJobName": "newscaster-male",
"TranscriptionJobStatus": "COMPLETED",
"LanguageCode": "en-US",
"MediaSampleRateHertz": 22050,
"MediaFormat": "mp3",
"Media": {
"MediaFileUri": "s3://BUCKET/26fd02f9-b5b4-480f-8e21-e75fc001efef.mp3"
},
"Transcript": {
"TranscriptFileUri": "https://s3.us-west-2.amazonaws.com/TRUNCATED..."
},
"StartTime": 1585890134.042,
"CreationTime": 1585890134.012,
"CompletionTime": 1585890189.432,
"Settings": {
"ChannelIdentification": false,
"ShowAlternatives": false
}
}
}

I truncated the TranscriptFileUri. Using the pre-signed URL, we can download the transcription.

The original text is 360 characters in length. The transcript output from Amazon Transcribe is 7,266 characters — there is a lot more in the response that just the text. If we look at the file, we see the transcribed text and more information about the transcription job. The result is a JSON structured file, which when processed, looks like

{
"jobName": "newscaster-male",
"accountId": "548985610555",
"results": {
"transcripts": [
{
"transcript": "Ladies and gentlemen, we interrupt our program of dance music to bring you a special bulletin from the Intercontinental Radio News at 20 minutes before eight, Central time. Professor Farrell of the Mount Jennings Observatory, Chicago, Illinois, reports observing several explosions of incandescent gas occurring at regular intervals on the planet Mars."
}
],
"items": [
{
"start_time": "0.04",
"end_time": "0.38",
"alternatives": [
{
"confidence": "1.0",
"content": "Ladies"
}
],
"type": "pronunciation"
}
...

The “transcripts” section provides the transcribed text, while the “items” section provides an analysis of each transcribed word, including the start and end times, the word, possible alternatives and the confidence level of the transcription.

Now, we have the transcript, and the original text to compare.

The original text is

Ladies and gentlemen, we interrupt our program of dance music to bring you a special bulletin from the Intercontinental Radio News. At twenty minutes before eight, central time, Professor Farrell of the Mount Jennings Observatory, Chicago, Illinois, reports observing several explosions of incandescent gas, occurring at regular intervals on the planet Mars.

The transcribed text looks like this:

Ladies and gentlemen, we interrupt our program of dance music to bring you a special bulletin from the Intercontinental Radio News at 20 minutes before eight, Central time. Professor Farrell of the Mount Jennings Observatory, Chicago, Illinois, reports observing several explosions of incandescent gas occurring at regular intervals on the planet Mars.

Let’s highlight the differences between the source and the transcribed text:

Ladies and gentlemen, we interrupt our program of dance music to bring you a special bulletin from the Intercontinental Radio News . At 20 minutes before eight, central time,** Professor Farrell of the Mount Jennings Observatory, Chicago, Illinois, reports observing several explosions of incandescent gas occurring at regular intervals on the planet Mars.

As we can see, there are a small number of errors, primarily around punctuation, which is very difficult to determine from the audio. I think Amazon Transcribe did very well transcribing the audio.

Amazon Polly and AWS-Shell

Using the AWS-Shell is like using the CLI, except you don’t have to type the aws command all the time, you get interactive prompts and help so you don’t have to remember the syntax for every command.

If you have never seen the AWS-Shell before, you can enter a partial command and the AWS-Shell will provide interactive guidance on your options.

Image for post

When we execute commands, it looks and feels like the CLI. Some other features make the AWS-Shell useful, such as command completion and command history.

Image for post

You can read more about the AWS-SHell in my Medium article.

Writing for Speech is different from Print

This article has focused on using Amazon Polly to convert text to speech. However, writing text to read is not quite the same as writing text for audio conversion.

I write my articles using Markdown and then submit to the Medium platform for publishing. However, simply sending the markdown to Polly would result in the audio output saying “hashtag hashtag there is a Little More to it”, which just isn’t right. Once I am satisfied with the article, I make a copy and edit it to remove the markdown elements, headings and adjust the text to make it suitable for Polly to read, and easier for you to listen to.

The content has to change because a listener wouldn’t want to hear all the code examples read, and pictures have to be converted to text to bring out the message the picture was intended to convey. This makes text to speech conversion harder. It becomes necessary for you to carefully plan the message and how it will be presented.

Finishing Up

Amazon Polly is an amazing tool when you consider the challenge of breaking text into the speech elements appropriate for the desired language, and then converting those speech elements into audio. Amazon Polly can be used to create a synthesized speech for use in Virtual Reality projects created using Amazon Sumerian, animation projects, real-time synthesis in applications, and more.

References

Amazon Polly

SSML Tags Supported by Amazon Polly

Amazon Polly Supported Languages and Voices

Neural Text To Speech Speaking Styles

AWS Python SDK

AWS SSML Reference

Python

AWS SDK documentation

AWS Shell

Visemes and Amazon Polly

Amazon Polly Speech Mark Examples

Generating Speech from SSML Documenta

Managing Amazon Polly Lexicons

W3C Pronunciation Lexicon Specification (PLS) Version 1.0

Amazon Polly — Applying Multiple Lexicons

War of the Worlds

About the Author

Chris is a highly-skilled Information Technology, AWS Cloud, Training and Security Professional bringing cloud, security, training and process engineering leadership to simplify and deliver high-quality products. He is the co-author of seven books and author of more than 70 articles and book chapters in technical, management and information security publications. His extensive technology, information security, and training experience makes him a key resource who can help companies through technical challenges.

Copyright

This article is Copyright © 2020, Chris Hare.

Written by

Chris is the co-author of seven books and author of more than 70 articles and book chapters in technical, management, and information security publications.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store