Three Ways to Pull a Transcript from Any YouTube Video

Yesterday I gave Claude a YouTube URL for a 55-minute interview and asked for the transcript. About twenty seconds later I had it. Clean, timestamped, ready to mine for teaching material.

So I want to walk through how that actually works. There are three methods I reach for, and which one wins depends on what is going on with the video.

The Old Way

Before any of this, you had a few not-great options. Sit and watch the whole thing taking notes. Pay a transcription service. Or copy-paste out of YouTube's built-in transcript panel, which dumps a wall of text with no paragraph breaks and timestamps mashed into every line. Fine for a one-off. Super painful for anything you want to do over and over.

Method 1: yt-dlp (What You Will Use Most Days)

yt-dlp is an open-source command-line tool that can pull just about anything off YouTube. Thumbnails, captions, audio, video, metadata, playlists. The community maintains it, and they ship fixes super fast when YouTube tries to break things.

When a video has auto-captions ready, which is almost always for anything more than a few hours old, this command pulls the transcript without downloading the video at all:

run this in your terminal

cd /tmp && yt-dlp --write-auto-subs --sub-lang en --sub-format vtt \
  --skip-download -o "yt-%(id)s" "https://www.youtube.com/watch?v=VIDEO_ID"

That writes a VTT file. VTT is the caption format used across the web, so it is plain text with timestamps. There is one annoying quirk you need to know about though. YouTube's auto-captions have rolling overlap. Each caption block repeats words from the previous one so the text slides onto the screen smoothly during playback. For reading, that produces heavy repetition, so you have to deduplicate.

That cleanup is a short Python function. Mine groups the captions into sixty-second paragraphs, strips inline styling tags, and skips repeated lines using a seen set. Thirty lines of code, and the output is clean timestamped paragraphs.

If this is your first time, expect five or ten minutes to install yt-dlp (brew install yt-dlp on a Mac, pipx install yt-dlp on Linux) and test it on a random video. Every run after that is seconds.

Method 2: youtube-transcript-api (When yt-dlp Has a Bad Day)

Every so often YouTube rolls out a player update that temporarily breaks yt-dlp's extraction. The community usually ships a fix within a day or two, but if you need a transcript right now, the youtube-transcript-api Python library is the next reach.

run this in your terminal

pip install youtube-transcript-api

python

from youtube_transcript_api import YouTubeTranscriptApi
transcript = YouTubeTranscriptApi.get_transcript("VIDEO_ID")

That returns a list of dictionaries with text, start, and duration for every caption block. Same underlying data, different fetching mechanism. So when one method is broken, the other usually still works.

Both methods share one downside: YouTube throttles per IP, somewhere in the neighborhood of 400 to 500 requests a day before things slow down. Not an issue from your personal machine. A real issue if you try to batch-process hundreds of videos from a cloud server.

Method 3: Whisper (When There Are No Captions)

Okay, so what if the video has captions disabled? Or it uploaded five minutes ago and YouTube has not generated them yet? For long videos that generation can take 30 to 45 minutes.

This is where Whisper comes in. Whisper is OpenAI's speech-to-text model, and it runs entirely on your own machine. No API keys, no usage limits, no internet connection required. You download the audio, feed it to Whisper, and it does its own transcription from scratch.

run this in your terminal

yt-dlp -x --audio-format mp3 -o "/tmp/%(id)s.%(ext)s" "https://www.youtube.com/watch?v=VIDEO_ID"
whisper /tmp/VIDEO_ID.mp3 --model base --output_format txt --output_dir /tmp/

Whisper uses ffmpeg under the hood, which means it handles anything ffmpeg can decode: MP3, MP4, M4A, WAV, FLAC, OGG, WebM, MOV, AVI, MKV. If it plays, Whisper can read it. I extract the audio to MP3 first just for speed, since a 55-minute MP4 is roughly ten times the size of the same audio and all that extra video data would have to be decoded for no reason.

Model sizes matter for accuracy. base is the fastest and fine for most things. small and medium get more accurate. large-v3 is the most accurate and the slowest. On my Mac Studio's GPU, a 55-minute audio file on the base model takes a few minutes.

First-time setup for Whisper is a little longer than yt-dlp. Installing ffmpeg, installing Whisper, downloading a model, expect maybe 15 to 20 minutes. Once everything is in place, a single transcription is one line in the terminal.

One note on legality, since this step downloads the actual audio. YouTube's Terms of Service do not allow downloading unless they show you a button for it, which they only do for YouTube Premium. Copyright is a separate question. Your own videos, or someone else's with their permission, is fine. Transcribing someone else's content for your own private research or notes has a fair-use argument in the US. Redistributing or commercially using that audio is clearly infringement. Be thoughtful about what you pull down and why.

Which One To Reach For

Default: yt-dlp. If it hits a wall, try youtube-transcript-api. If the video has no captions and you cannot wait for YouTube to generate them, Whisper.

Most days you will only touch the first one. The other two exist because the edge cases happen, and when they do it is super nice to have options.

All three of these are free and open source. yt-dlp lives at github.com/yt-dlp/yt-dlp, youtube-transcript-api is at pypi.org/project/youtube-transcript-api, and Whisper is at github.com/openai/whisper. Credit to the maintainers of all three. They do a lot of quiet work so the rest of us can pull a transcript in one command.