A Five Minute Overview of AWS Transcribe

In a previous article, I discussed using AWS Polly to take my article text and convert it to speech. AWS Transcribe is a speech to text service, allowing users to create a text transcription of an audio recording. The service is capable of detecting different voices in the conversation and adding time stamps. In this article, we are going to look at what AWS Transcribe does, how we can use it, and a sample implementation.

AWS Transcribe is a highly scalable, on-demand speech to text transcription service. It is capable of analyzing multiple audio formats from files stored in S3, and provides accurate transcription, including timestamps for each word. AWS Transcribe can also operate on streaming audio, providing a stream of transcribed text in real-time.

AWS Transcribe supports 16 languages, including 4 English variants. However, streaming transcription only supports 5 languages as of October 2019.

Because so many of our interactions involve more than one person, AWS Transcribe can identify different speakers. When the job is submitted, you will have to indicate the number of voices AWS Transcribe should recognize. Additionally, if the audio is provided on different channels, say a customer on one channel and the call center representative on another, the channel can be used to identify which voice is which and provide transcripts of each channel and the two channels together.

AWS Transcribe is driven by the AWS AI/ML platform, so it gets smarter over time.

There are several scenarios where transcription services can be used, such as:

  • Converting audio meeting recordings into text;
  • Creating subtitles/closed captioning based upon the actual speech;
  • Analyzing customer call center interactions;
  • Enabling rich search capabilities of audio and video archives;
  • Targeting advertising based upon content; and,
  • Creating text notes of medical dictation.

Essentially, any situation where an audio or speech portions of a video recording is converted to text is a practical application of AWS Transcribe.

AWS Transcribe has four APIs: — StartTranscriptionJob, to submit a new transcription request; — ListTranscriptionJobs, to get the status on the submitted transcription tasks; — GetTranscriptionJob, to get the status of a specified transcription task; and, — startStreamTranscription, to process an audio stream in real-time, instead of a file.

Additional APIs exist for managing custom vocabularies.

Let’s look at submitting a transcription request through the console, and then using the Python SDK. It is also possible to start and monitor an AWS Transcribe task using the AWS command-line interface.

To create a new Transcription job, open the AWS Console and navigate to the Transcribe view.

Image for post

This view shows any previous transcription jobs and their status. In this view, we have no submitted transcription jobs. To create a new transcription job, click on the create button.

Image for post

When creating the job in the console, you have to know the path to the S3 file. You cannot search for it in the console. After providing the job details and the source file to transcribe, there are a couple of options for the output.

Image for post

The Output location allows you to choose where the transcription results will be stored. If you choose “Amazon Default”, the transcript is stored in an AWS provided, secured S3 bucket, and a URI is provided to allow you to retrieve the results. Selecting “Customer Specified” allows you to specify the S3 bucket. Your specific security posture and organization policies will determine which option is best in your situation.

Once you have provided the details for the transcription, click the “Create” button to submit the transcription request. Once the job is submitted, you are returned to the job listing.

Image for post

We can see the job is in progress. If we had other jobs in progress or completed, they would also be shown in the list. We can view the details of the job by clicking on the job name.

Image for post

In this view, our transcription job is still in progress, so we cannot see any output as yet. Once the transcription job has been completed, we can see the transcribed text.

Image for post

The Transcription Preview section shows the first part of the transcribed text. This example is from a Polly generated audio file for one of my Medium articles.

To download the entire transcription, scroll back up the page and click on the “Download full transcript” button. In a web browser, you will see this:

Image for post

This view isn’t entirely helpful, so you will want to save the transcription as a file for further processing and analysis. The resulting transcription file is a JSON formatted document.

Like many of the AWS services, transcription jobs can be submitted using the Command Line Interface or the SDK for the language of your choice. For this article, we will be using Python3. There are two major parts to our example code, one to submit the job and the other to check the status.

transcribe = boto3.client('transcribe')
try:
response = transcribe.start_transcription_job(
TranscriptionJobName=jobname,
Media={'MediaFileUri': job_url},
MediaFormat=format,
LanguageCode=language,
MediaSampleRateHertz=sample)
except Exception as e:
print(e)
sys.exit()
print(f"Request submitted: {response['ResponseMetadata']['RequestId']}")

The section takes the parameters provided and passes them to the start_transcription_job API call. this API call needs the name of the job, the URI for the source media file, the media format, the language code (e.g. en-US), and the sample rate. Once the job is submitted, the code section prints the RequestId for the job.

The sample code also allows you to monitor the status of the just submitted job, and return the URL for the transcript file once the job is complete.

transcribe = boto3.client('transcribe')
while True:
status = transcribe.get_transcription_job(TranscriptionJobName=jobname)
if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
end = datetime.now()
break
print("Not ready yet...")
time.sleep(5)
print(f"processing time is {end - start}")
print(f"transcript URL is {status['TranscriptionJob']['Transcript']['TranscriptFileUri']}")

To monitor the status of the job, we just need to submit the jobname to the get_transcription_job API. This code segment will keep checking until the job status is COMPLETED or FAILED. The transcript file URL is then printed, so you can retrieve it. Here is a sample execution.

$ python3 submit.py -j sample9 -m s3://bucket/prefix/audio.mp3 -f mp3 -l en-US -s 22050 -x
Request submitted: eb69985e-5bfa-4b1f-be69-5521631e5a66
Not ready yet...
Not ready yet...
Not ready yet...
processing time is 0:02:18.462067
transcript URL is https://s3.amazonaws.com/aws-transcribe-us-east-1-prod/548985610555/sample9/40fd2d4f-56b9-4c54-9a3f-fe7e205da6f8/asrOutput.json?X-Amz-Security-Token=....&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20191012T053901Z&X-Amz-SignedHeaders=host&X-Amz-Expires=900&X-Amz-Credential=...&X-Amz-Signature=005461a53702b667498ec8e12b986601ff0e4bb458efde846877e4fbf70b5dc8
$

A link to the code used to submit the request and monitor the output is included at the end of the article.

We can use Python to get the start and end times, confidence level and the text detected. Here is a sample of the transcribed audio.

0.04 0.14 0.9908 a
0.14 0.47000000000000003 1.0 five
0.48 0.77 1.0 minute
0.77 1.21 0.9979 overview
1.21 1.32 1.0 of
1.33 2.08 0.7 AWS
2.09 2.9499999999999997 0.9778 outposts
0 0 0.0 .
3.49 4.38 0.9981 Organizations
4.38 4.75 1.0 across
4.75 4.87 1.0 the
4.87 5.17 1.0 globe
5.17 5.33 0.9988 are
5.33 5.68 1.0 moving
5.68 6.0 1.0 major
6.0 6.32 1.0 parts
6.32 6.41 1.0 of
6.41 6.5600000000000005 0.9902 their

The code used to print this is at the end of the article.

This output shows the start time from the beginning of the audio where the word was detected, the end time — from which we can compute how long it takes to say the word, the confidence level (1.0 being 100% confidence), and the transcribed text.

If we look at this section of the transcribed output:

8.02 8.34 1.0 Using
8.34 8.9 0.98 partners
8.9 9.120000000000001 1.0 like
9.13 9.700000000000001 1.0 Amazon
9.7 9.92 0.9986 Web
9.92 10.38 0.9497 service
10.38 10.73 0.9497 is

and compare it with the original text, we would see a discrepancy:

using partners like Amazon Web Services,

The transcribed audio heard “Service is”, not “Services”, so it is not 100% correct. This is where the confidence value is important. By looking for words will a less than 100% confidence level, you can focus on fixing those aspects of the transcript before use.

AWS Transcribe is charged by the number of transcription seconds per month. The service is eligible for Free Tier, with 60 minutes of transcription per month. The billing rate above the Free Tier, or after the Free Tier is over, is $0.0004 per second of audio. It is billed in 1 second increments, with a minimum of 15 seconds.

This means 15 seconds of transcribed audio costs $0.06 If you have 200 minutes of audio a month, the charge would be $4.80. (These prices are for the us-East-1, North Virginia region. Prices in other regions may vary.

I got interested in this service because it is the opposite of AWS Polly. I can see some wide-ranging applications for this service, particularly if you have audio recordings of speeches or conferences you might want to generate a transcript for, but also meetings, phone calls (subject to your local state’s wiretapping regulations), and more.

Other examples could include audio recordings of court proceedings and dictated medical audio.

Organizations processing audio files and need to implement a transcription service to perform analysis on the audio would be well advised to use this service. In a call center context, analysis could yield if agents are saying certain things to customers, evaluate if customers are using abusive language to call center agents, and possibly identify common problems identified by the customer.

Video services can use the service to generate closed captioning and subtitles for video, either as a service (like the captioning you see on TVs in restaurants) or to be compliant with the Americans with Disabilities Act. Like many AWS services, the possible use cases for a service like Transcribe abound.

There are meeting recording and transcriptions applications, like Otter, which illustrate the benefits of a transcription service. I don’t know if Otter uses AWS Transcribe, but it illustrates a possible use case.

This sample code is used to submit a job to AWS Transcribe through the SDK, and monitor the status of the submitted job. You can also download this file

This code was used to convert the resulting JSON document into a start time, end time, confidence level and detected text. You can also download this file.

AWS Transcribe

AWS Polly

AWS Transcribe — Accurate Speech to Text at Scale

Chris is a highly-skilled Information Technology AWS Cloud, Training and Security Professional bringing cloud, security, training and process engineering leadership to simplify and deliver high-quality products. He is the co-author of more than seven books and author of more than 70 articles and book chapters in technical, management and information security publications. His extensive technology, information security, and training experience makes him a key resource who can help companies through technical challenges.

This article has been cross-posted to LinkedIn and Medium.

This article is Copyright © 2019, Chris Hare.

Written by

Chris is the co-author of seven books and author of more than 70 articles and book chapters in technical, management, and information security publications.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store