AWS Transcription: User Guide

Overview

The Social Sciences Division provides transcription service via Amazon Web Services (AWS).

S3 is the name of Amazon Web Services (AWS) cloud storage service where input and output files are stored. Transcribe is an Amazon service that converts audio files into text files. This pilot version is designed to provide a fast first draft of a transcribed output file. Cleaning and analysis should be completed using other University of Chicago resources.

See the SSCS Research Transcription Service webpage for more information.

Supported Features

Before Beginning: Review Supported Features

Supported Languages
Supported languages and language-specific features – Amazon Transcribe

Supported File Formats
Data input and output – Amazon Transcribe, see Media Formats-Supported Formats

Supported Naming Conventions
Creating object key names – Amazon Simple Storage Service

Audio File Time Limitations
Audio file sizes longer than three hours will not work in this pilot version. SSCS recommends audio files in one-two hour maximum lengths.

Multi-Speaker Limitation
The output file will identify speakers in the transcript. The maximum number of identifiable speakers is nine.

Speaker Identification
The transcribe model numerically labels the speakers and starts with an automatic default identification number of zero.

Multi-Language Feature
The SSCS transcribe model is built to identify multiple languages. It is not yet known if there is a limit of language identification per file.

Custom Features
At this time, the pilot version is not offering additional custom features that may be listed on the AWS website.

Logging Into Your AWS Account

Step 1: Login to https://uchicago.awsapps.com/start#/

Fill in your CNET credentials, followed by 2FA

Step 2: Access the AWS portal

Click on SSD-SSCS-LowRisk

Expand the arrow and click on SSD-SSCS-LowRisk-Users

 

Upload the Audio File

Step 3: Select Amazon S3 Storage Service

Type S3 in the search bar and select S3 (Scalable storage in the cloud) to open the console.

 

Step 4: Select your Bucket

AWS uses the term ‘Bucket’ the same way Directories are defined in a File System on a computer.

Clicking on S3 will direct the user to a page that displays a list of various buckets. Users will not have permissions to access other buckets.

The user will be provided access to a specific bucket created by SSCS. The bucket name will follow the general naming convention: [cnet]uchicagoedu-[AWS random provided 12 digits].

This AWS ‘Bucket’ is designated for uploading the audio files related to the respective project.

 

Step 5: View Folders within your Bucket

Your bucket will have an input and output folder.

The general naming convention for the Input folder is: “Audio_Files

The general naming convention for the Output folder is: “Transcription_Output

This “Transcription Output” folder will automatically be populated with the output files, once the audio files have been submitted for transcription.

 

Step 6: Upload Audio Files

In the folder “Audio_Files/” you can click on either of the “Upload” buttons. Note that the second upload option will only appear if there are no files in the folder.

To add files, you can either (1) drag and drop files into the indicated field or (2) click “Add Files,” select your audio files, and then click on “Upload.”

 

Step 7: Upload Success

The successful upload will have a green checkmark in the “Status” column, under the “Files and Folders” section.

Once the upload is successful, click on the “Close” button to navigate back to the Files and Folders section.

There is no notification that the transcription is in progress. You will have to wait and manually check for status completion.

 

Download the Transcription File

Step 8: Navigate back to your S3 Bucket

(see steps 2-4)

 

Step 9: Select the Transcript Output Folder

(see step 5)

Click on the “Transcription_Output” folder

 

Step 10: Select the Output Object

In the Objects section, select the output file that reflect the input file’s name with the word document extension.

The standard output file naming convention will be:
autotranscription_[bucket name]_Audio_Files_[ audio file name]_[ audio file format]_[date with 4 random numbers].[docx or json]

 

Step 11: Download and Save Transcription Output

Select Download

Save file to appropriate your appropriate research storage space.

The complete text from the transcription is not fully visible for preview within the AWS environment. Therefore, it is highly recommended to download the results.

 

Closeout Project

Step 12: Closeout Project

When all of your files have been transcribed, please contact SSCS ssdtnt@uchicago.edu. SSCS will then closeout the account and delete all files in it. Files in AWS cannot be kept indefinitely as AWS cannot be used as long-term storage.

Data Deletion Policy

Input and Output files in AWS are automatically deleted after 30 days. Please store your original and output files as specified in your IRB.

Support

For training or troubleshooting support, please contact SSCS Teaching and Technology. You can contact the T&T team directly at ssdtnt@uchicago.edu.