Building a transcrition script for audio and videos

Peter Friedlander

After more than a decade of journalism, audio notes, and interviews, which I have accumulated on a hard drive I bought a few years ago, I have found a rather large archive of material, which, if ever needed again, would be a nightmare to search through. Nevertheless, we also have old family tapes from the eighties and nineties, which have been converted to digital format. One afternoon, my inner specialist was triggered by a thought: Was there a more accurate way of transcribing these assets other than paying an arm and a leg for thousands of hours' worth of transcription?

I had always used services like Otter and other short-term solutions, but I really wanted three things: the automated process of transcription, accuracy, and a flat-fee or near-free solution.

There was the trick of reformatting them to video, uploading them to YouTube, letting YouTube do its thing, and downloading the SRT timestamped files. But this was a little cumbersome and, as I would find out, less accurate when picking up certain words.

This is where my programming skills came into play. With a little bit of curiosity, I stumbled onto shell scripting, Python, and a repository developed by OpenAI called Whisper.

After several attempts at pre-made scripts, including GUI versions like Whisper-mate and Whisper-script, I found my homemade alternative gave me far superior results. I was quite happy with my development. Despite my puzzled wife’s reaction as I tried to explain my discovery, I loaded a bunch of videos and audio files into a directory, let it rip, and effortlessly, it transcribed them, creating time-coded SRT files. Below is the code. You may need to install dependencies such as FFmpeg and Whisper if you encounter terminal run errors.

My resource: https://github.com/openai/whisper

#!/bin/bash

# Prompt the user for the input directory
read -p "Please enter the directory where your audio/video files are located: " input_dir

# Prompt the user for the output directory
read -p "Please enter the directory where you want the Whisper output to be saved: " output_dir

# Check if the input directory exists
if [ ! -d "$input_dir" ]; then
echo "Input directory does not exist. Exiting..."
exit 1
fi

# Check if the output directory exists, if not, create it
if [ ! -d "$output_dir" ]; then
echo "Output directory does not exist. Creating it..."
mkdir -p "$output_dir"
fi

# Search for audio/video files in the provided directory
find "$input_dir" -type f $ -iname "*.mp4" -o -iname "*.wav" -o -iname "*.flac" -o -iname "*.mp3" -o -iname "*.amr" -o -iname "*.m4a" $ | while read -r file; do
# Check if the file is already an MP3

Echo "PROCESSING"+ $file;

if [[ "$file" != *.mp3 ]]; then
# Create a temporary MP3 file
temp_mp3="ragtemp.mp3"

Echo "CONVERTING"+ $file;
Echo "TEMP MP3"+ $temp_mp3;

# Convert to MP3 using ffmpeg
ffmpeg -i "$file" -q:a 0 -map a "$temp_mp3"

# Pass the temporary MP3 file to whisper
whisper "$temp_mp3" --language English --output_format srt --threads 4 --output_dir "$output_dir"

# Remove the temporary MP3 file after whisper is done
rm -f "$temp_mp3"
else
# If already an MP3, use it directly
whisper "$file" --language English --output_format srt --threads 4 --output_dir "$output_dir"
fi
done

I had always used services like Otter and other short-term solutions, but I really wanted three things: the automated process of transcription, accuracy, and a flat-fee or near-free solution.

There was the trick of reformatting them to video, uploading them to YouTube, letting YouTube do its thing, and downloading the SRT timestamped files. But this was a little cumbersome and, as I would find out, less accurate when picking up certain words.

This is where my programming skills came into play. With a little bit of curiosity, I stumbled onto shell scripting, Python, and a repository developed by OpenAI called Whisper.

My resource: https://github.com/openai/whisper

Find me on Social