AWS Transcription: User Guide
Overview
The Social Sciences Division provides transcription service via Amazon Web Services (AWS).
S3 is the name of Amazon Web Services (AWS) cloud storage service where input and output files are stored. Transcribe is an Amazon service that converts audio files into text files. This pilot version is designed to provide a fast first draft of a transcribed output file. Cleaning and analysis should be completed using other University of Chicago resources.
See the SSCS Research Transcription Service webpage for more information.
Supported Features
Before Beginning: Review Supported Features
Supported Languages
Supported languages and language-specific features – Amazon Transcribe
Supported File Formats
Data input and output – Amazon Transcribe, see Media Formats-Supported Formats
Supported Naming Conventions
Creating object key names – Amazon Simple Storage Service
Audio File Time Limitations
Audio file sizes longer than three hours will not work in this pilot version. SSCS recommends audio files in one-two hour maximum lengths.
Multi-Speaker Limitation
The output file will identify speakers in the transcript. The maximum number of identifiable speakers is nine.
Speaker Identification
The transcribe model numerically labels the speakers and starts with an automatic default identification number of zero.
Multi-Language Feature
The SSCS transcribe model is built to identify multiple languages. It is not yet known if there is a limit of language identification per file.
Custom Features
At this time, the pilot version is not offering additional custom features that may be listed on the AWS website.
Logging Into Your AWS Account
Step 1: Login to https://uchicago.awsapps.com/start#/
Fill in your CNET credentials, followed by 2FA
Step 2: Access the AWS portal
Click on SSD-SSCS-LowRisk
Expand the arrow and click on SSD-SSCS-LowRisk-Users
Upload the Audio File
Step 3: Select Amazon S3 Storage Service
Type S3 in the search bar and select S3 (Scalable storage in the cloud) to open the console.
Step 4: Select your Bucket
AWS uses the term ‘Bucket’ the same way Directories are defined in a File System on a computer.
Clicking on S3 will direct the user to a page that displays a list of various buckets. Users will not have permissions to access other buckets.
The user will be provided access to a specific bucket created by SSCS. The bucket name will follow the general naming convention: [cnet]uchicagoedu-[AWS random provided 12 digits].
This AWS ‘Bucket’ is designated for uploading the audio files related to the respective project.
Step 5: View Folders within your Bucket
Your bucket will have an input and output folder.
The general naming convention for the Input folder is: “Audio_Files“
The general naming convention for the Output folder is: “Transcription_Output“
This “Transcription Output” folder will automatically be populated with the output files, once the audio files have been submitted for transcription.
Step 6: Upload Audio Files
In the folder “Audio_Files/” you can click on either of the “Upload” buttons. Note that the second upload option will only appear if there are no files in the folder.
To add files, you can either (1) drag and drop files into the indicated field or (2) click “Add Files,” select your audio files, and then click on “Upload.”
Step 7: Upload Success
The successful upload will have a green checkmark in the “Status” column, under the “Files and Folders” section.
Once the upload is successful, click on the “Close” button to navigate back to the Files and Folders section.
There is no notification that the transcription is in progress. You will have to wait and manually check for status completion.
Download the Transcription File
Step 8: Navigate back to your S3 Bucket
(see steps 2-4)
Step 9: Select the Transcript Output Folder
(see step 5)
Click on the “Transcription_Output” folder
Step 10: Select the Output Object
In the Objects section, select the output file that reflect the input file’s name with the word document extension.
The standard output file naming convention will be:
autotranscription_[bucket name]_Audio_Files_[ audio file name]_[ audio file format]_[date with 4 random numbers].[docx or json]
Step 11: Download and Save Transcription Output
Select Download
Save file to appropriate your appropriate research storage space.
The complete text from the transcription is not fully visible for preview within the AWS environment. Therefore, it is highly recommended to download the results.
Closeout Project
Step 12: Closeout Project
When all of your files have been transcribed, please contact SSCS ssdtnt@uchicago.edu. SSCS will then closeout the account and delete all files in it. Files in AWS cannot be kept indefinitely as AWS cannot be used as long-term storage.
Data Deletion Policy
Input and Output files in AWS are automatically deleted after 30 days. Please store your original and output files as specified in your IRB.
Support
For training or troubleshooting support, please contact SSCS Teaching and Technology. You can contact the T&T team directly at ssdtnt@uchicago.edu.