How to Get YouTube, TikTok & Instagram Transcripts with One API (2026)
Pull spoken text from public YouTube, TikTok, and Instagram Reels as clean JSON — caption tracks, auto-generated speech, optional AI fallback for uncaptioned TikToks, and patterns for batch jobs and RAG pipelines.
Video does not embed well. Search engines cannot index spoken words inside a Reel, and most LLM context windows are wasted if you paste a URL and hope the model watched the clip. Transcripts fix that: they turn audio into text you can quote, grep, chunk, and score.
The catch is that YouTube, TikTok, and Instagram each store captions differently — different formats, different fallbacks, different empty states. Social Fetch gives you three GET endpoints behind one auth header. Your job is picking the right one per URL and normalizing the output for whatever comes next (a spreadsheet, a vector index, a summarization job).
The short version
GET /v1/youtube/videos/transcript, GET /v1/tiktok/videos/transcript, and GET /v1/instagram/posts/transcript. Pass a public video URL, read data.lookupStatus, extract text. TikTok supports useAiFallback=true when no caption track exists.
You'll need an API key and curl or the TypeScript SDK. New here? Start with the Quickstart or try lookups in the Playground.
What each platform actually gives you
Before you wire a pipeline, know what comes back — the shapes are not identical:
| Platform | Endpoint | Transcript shape | Timestamps | No-speech fallback |
|---|---|---|---|---|
| YouTube | /v1/youtube/videos/transcript | segments[] + plainText + language | Millisecond offsets per segment | None — transcript may be null |
| TikTok | /v1/tiktok/videos/transcript | WebVTT string in transcript.content | In the VTT cues | useAiFallback=true (+10 credits) |
/v1/instagram/posts/transcript | Plain text per row in transcripts[] | None in the response | None — text: null when no speech detected |
All three share data.lookupStatus, meta.creditsCharged, and meta.requestId. That is enough to build one batch worker; you just need a thin normalization layer on top (covered below).
YouTube transcript
YouTube is the easy case. The UI has a transcript panel; the API returns structured segments you can use as-is or flatten.
Pass a public watch URL:
curl -sS \
-H "x-api-key: $SOCIALFETCH_API_KEY" \
-G "https://api.socialfetch.dev/v1/youtube/videos/transcript" \
--data-urlencode "url=https://www.youtube.com/watch?v=dQw4w9WgXcQ"Full parameters: YouTube transcript reference.
A successful lookup with captions looks like this:
{
"data": {
"lookupStatus": "found",
"video": {
"id": "dQw4w9WgXcQ",
"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
},
"transcript": {
"segments": [
{ "text": "We're no strangers to", "startMs": 18800, "endMs": 25960 }
],
"plainText": "We're no strangers to\nlove. You know the rules...",
"language": "English"
}
},
"meta": {
"requestId": "req_01example",
"creditsCharged": 1,
"version": "v1"
}
}Fields worth remembering:
transcript.plainText— ready for embeddings or summarization without parsing.transcript.segments— use when you need clip boundaries ("quote the section at 2:14").transcript.language— human-readable label from the lookup; pair with thelanguagequery param when you request a specific track.
YouTube can also return lookupStatus: "found" with transcript: null — the video resolved, but no caption track exists (common on music-only uploads or videos where the creator disabled captions). Treat that as "no text available," not as an error.
| What | Cost |
|---|---|
| Base YouTube transcript lookup | 1 credit |
TikTok transcript
TikTok has no transcript export in the app. Auto-captions show on screen during playback, but there is nothing to copy — and a large share of videos have no captions at all. The API returns WebVTT when a caption track exists.
const videoUrl = "https://www.tiktok.com/@mrbeast/video/7596844935442189598";
const response = await fetch(
`https://api.socialfetch.dev/v1/tiktok/videos/transcript?url=${encodeURIComponent(videoUrl)}`,
{
headers: {
"x-api-key": process.env.SOCIALFETCH_API_KEY,
},
}
);
const body = await response.json();
console.log(response.status, body);See the dedicated TikTok transcript guide for a full walkthrough (WebVTT parsing, screenshots, legal notes). Reference: TikTok transcript.
| What | Cost |
|---|---|
| Base TikTok transcript lookup | 1 credit |
With useAiFallback=true | +10 credits (11 total on a completed lookup) |
When TikTok has no captions
If the creator never enabled captions, there is nothing to scrape. Enable AI fallback and the audio is transcribed instead:
const videoUrl = "https://www.tiktok.com/@mrbeast/video/7596844935442189598";
const response = await fetch(
`https://api.socialfetch.dev/v1/tiktok/videos/transcript?url=${encodeURIComponent(videoUrl)}&useAiFallback=true`,
{
headers: {
"x-api-key": process.env.SOCIALFETCH_API_KEY,
},
}
);
const body = await response.json();
console.log(response.status, body);Use fallback when you need text from every video in a batch. Skip it when you only want existing caption tracks and would rather drop uncaptioned clips.
Reading the TikTok response
{
"data": {
"lookupStatus": "found",
"video": {
"id": "7596844935442189598",
"url": "https://www.tiktok.com/@mrbeast/video/7596844935442189598"
},
"transcript": {
"format": "webvtt",
"content": "WEBVTT\n\n00:00:00.060 --> 00:00:03.100\nThis is the world's largest LED floor.\n\n00:00:03.101 --> 00:00:09.433\nAnd now it's the world's largest green screen."
}
},
"meta": {
"requestId": "req_01example",
"creditsCharged": 1,
"version": "v1"
}
}Strip WebVTT timing cues when you need flat text:
import { SocialFetchClient } from "@socialfetch/sdk";
const client = new SocialFetchClient({
apiKey: process.env.SOCIALFETCH_API_KEY!,
});
const result = await client.tiktok.getVideoTranscript({
url: "https://www.tiktok.com/@mrbeast/video/7596844935442189598",
});
if (!result.ok) {
console.error(result.error.code, result.error.requestId);
process.exit(1);
}
const vtt = result.value.data.transcript?.content ?? "";
const plainText = vtt
.replace(/\r\n/g, "\n")
.split("\n")
.map((line) => line.trim())
.filter(
(line) =>
line &&
line !== "WEBVTT" &&
!line.includes("-->") &&
!/^\d+$/.test(line),
)
.join(" ");
console.log(plainText);Instagram Reels transcript
Instagram Reels and video posts expose speech as plain text — no WebVTT, no segment array. Carousel posts can return multiple transcripts[] rows (one per video item in the carousel).
curl -sS \
-H "x-api-key: $SOCIALFETCH_API_KEY" \
-G "https://api.socialfetch.dev/v1/instagram/posts/transcript" \
--data-urlencode "url=https://www.instagram.com/reel/DHsD6HGqJhp/"Reference: Instagram post transcript.
{
"data": {
"lookupStatus": "found",
"post": {
"url": "https://www.instagram.com/reel/DHsD6HGqJhp/"
},
"transcripts": [
{
"id": "3597267389859272809",
"shortcode": "DHsD6HGqJhp",
"text": "Let's fry up the perfect Banh Xeo. Beautiful. Everybody..."
}
]
},
"meta": {
"requestId": "req_01example",
"creditsCharged": 1,
"version": "v1"
}
}For carousels, loop data.transcripts and concatenate non-null text values. When text is null, the video resolved but no speech was detected (silent Reel, music-only, or very short clip).
| What | Cost |
|---|---|
| Base Instagram transcript lookup | 1 credit |
Instagram has no useAiFallback flag — you get whatever speech detection returns.
Captions vs auto-generated speech
These terms get mixed up constantly. Here is how they map to API behavior:
| Term | What it means on the platform | What you get from the API |
|---|---|---|
| Manual captions | Creator-uploaded or edited SRT/VTT | Returned like any other track — YouTube/TikTok do not label "manual" vs "auto" in the response |
| Auto-generated captions | Platform speech-to-text (YouTube "auto", TikTok on-screen captions) | Same endpoints — you receive the best available track |
| No captions | Creator disabled them, or TikTok never generated them | YouTube: transcript: null. TikTok: not_found unless AI fallback. Instagram: text: null |
You cannot request "manual only" or "auto only." Pass language when multiple tracks exist and inspect transcript.language on YouTube. For quality-sensitive workflows (legal review, quote verification), spot-check a sample against the source video — auto-generated tracks mishear proper nouns and punctuation routinely.
Language handling
YouTube and TikTok accept an optional two-letter language query parameter (ISO 639-1) to prefer a specific track:
# YouTube — prefer Spanish when available
curl -sS \
-H "x-api-key: $SOCIALFETCH_API_KEY" \
-G "https://api.socialfetch.dev/v1/youtube/videos/transcript" \
--data-urlencode "url=https://www.youtube.com/watch?v=dQw4w9WgXcQ" \
--data-urlencode "language=es"TikTok language selection:
const videoUrl = "https://www.tiktok.com/@mrbeast/video/7596844935442189598";
const response = await fetch(
`https://api.socialfetch.dev/v1/tiktok/videos/transcript?url=${encodeURIComponent(videoUrl)}&language=en`,
{
headers: {
"x-api-key": process.env.SOCIALFETCH_API_KEY,
},
}
);
const body = await response.json();
console.log(response.status, body);If the requested language is not available, the lookup returns whatever track exists — check the response body rather than assuming the parameter was honored. Instagram has no language parameter; the returned text is in whatever language was spoken.
For multilingual RAG indexes, store language (from YouTube's field or your own detector) on every document chunk so retrievers can filter by locale.
Normalize to one document shape
Production pipelines rarely want three different parsers in every downstream job. Map each platform response to an internal TranscriptDoc:
type TranscriptDoc = {
platform: "youtube" | "tiktok" | "instagram";
url: string;
videoId: string;
plainText: string;
language?: string;
segments?: { text: string; startMs: number; endMs: number }[];
lookupStatus: string;
creditsCharged: number;
requestId: string;
};
function vttToPlainText(vtt: string): string {
return vtt
.replace(/\r\n/g, "\n")
.split("\n")
.map((line) => line.trim())
.filter(
(line) =>
line &&
line !== "WEBVTT" &&
!line.includes("-->") &&
!/^\d+$/.test(line),
)
.join(" ");
}
function normalizeYouTube(body: {
data: {
lookupStatus: string;
video: { id: string; url: string } | null;
transcript: {
plainText: string;
language: string;
segments: { text: string; startMs: number; endMs: number }[];
} | null;
};
meta: { creditsCharged: number; requestId: string };
}): TranscriptDoc | null {
if (!body.data.video) return null;
return {
platform: "youtube",
url: body.data.video.url,
videoId: body.data.video.id,
plainText: body.data.transcript?.plainText ?? "",
language: body.data.transcript?.language,
segments: body.data.transcript?.segments,
lookupStatus: body.data.lookupStatus,
creditsCharged: body.meta.creditsCharged,
requestId: body.meta.requestId,
};
}
function normalizeTikTok(body: {
data: {
lookupStatus: string;
video: { id: string; url: string } | null;
transcript: { content: string } | null;
};
meta: { creditsCharged: number; requestId: string };
}): TranscriptDoc | null {
if (!body.data.video) return null;
const vtt = body.data.transcript?.content ?? "";
return {
platform: "tiktok",
url: body.data.video.url,
videoId: body.data.video.id,
plainText: vttToPlainText(vtt),
lookupStatus: body.data.lookupStatus,
creditsCharged: body.meta.creditsCharged,
requestId: body.meta.requestId,
};
}
function normalizeInstagram(body: {
data: {
lookupStatus: string;
post: { url: string } | null;
transcripts: { id: string; text: string | null }[];
};
meta: { creditsCharged: number; requestId: string };
}): TranscriptDoc | null {
if (!body.data.post) return null;
const plainText = body.data.transcripts
.map((row) => row.text)
.filter((t): t is string => Boolean(t))
.join("\n\n");
return {
platform: "instagram",
url: body.data.post.url,
videoId: body.data.transcripts[0]?.id ?? "",
plainText,
lookupStatus: body.data.lookupStatus,
creditsCharged: body.meta.creditsCharged,
requestId: body.meta.requestId,
};
}One normalizer per platform, one schema for everything downstream.
Batch processing
Research and repurposing jobs usually start with a list of URLs — a CSV export, a search result, a creator's recent uploads. Fetch them with bounded concurrency so you do not hammer the API or your own rate limits:
import { SocialFetchClient } from "@socialfetch/sdk";
const client = new SocialFetchClient({
apiKey: process.env.SOCIALFETCH_API_KEY!,
});
type Job = { platform: "youtube" | "tiktok" | "instagram"; url: string };
const jobs: Job[] = [
{ platform: "youtube", url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ" },
{
platform: "tiktok",
url: "https://www.tiktok.com/@mrbeast/video/7596844935442189598",
},
{
platform: "instagram",
url: "https://www.instagram.com/reel/DHsD6HGqJhp/",
},
];
async function fetchTranscript(job: Job) {
if (job.platform === "youtube") {
return client.youtube.getVideoTranscript({ url: job.url });
}
if (job.platform === "tiktok") {
return client.tiktok.getVideoTranscript({
url: job.url,
useAiFallback: true, // set false to skip uncaptioned videos
});
}
return client.instagram.getPostTranscript({ url: job.url });
}
const CONCURRENCY = 5;
const results: TranscriptDoc[] = [];
for (let i = 0; i < jobs.length; i += CONCURRENCY) {
const batch = jobs.slice(i, i + CONCURRENCY);
const settled = await Promise.all(batch.map(fetchTranscript));
for (const result of settled) {
if (!result.ok) {
console.error(result.error.code, result.error.requestId);
continue;
}
const doc =
result.value.data.lookupStatus === "found"
? normalizeFromEnvelope(result.value) // wire your normalizers here
: null;
if (doc?.plainText) results.push(doc);
}
}
console.log(results.length, "transcripts ready");Practical batch tips:
- Budget credits upfront. TikTok with AI fallback costs 11 credits per completed lookup. A thousand-video batch is not a rounding error — multiply before you run.
- Persist
requestIdon every row. When a transcript looks wrong, support can trace the exact lookup. - Skip empty text, not failed HTTP.
lookupStatus: "not_found"still returns HTTP 200 and is charged. Filter in application logic. - Discover URLs first. Pair transcript fetches with profile/video listing endpoints (TikTok profile videos, YouTube channel) so you are not hand-curating links.
Chunking for RAG and LLMs
Once you have plainText, the embedding step is standard — but video transcripts benefit from a few conventions:
- Chunk by sentence, not arbitrary character count. Caption segments often break mid-phrase; rejoin into sentences before splitting.
- Target 300–600 tokens per chunk with ~50-token overlap. Short-form video (TikTok, Reels) may fit in one chunk; long YouTube uploads need many.
- Attach metadata on every chunk:
platform,url,videoId,language,startMs/endMswhen you have segments. - Store the raw timed version separately. Retrieval uses plain text; clip generation and citation UI need timestamps from YouTube
segmentsor TikTok WebVTT.
function chunkForRag(
doc: TranscriptDoc,
maxChars = 2000,
overlapChars = 200,
): { text: string; metadata: Record<string, string> }[] {
const sentences = doc.plainText
.replace(/\s+/g, " ")
.split(/(?<=[.!?])\s+/)
.filter(Boolean);
const chunks: string[] = [];
let current = "";
for (const sentence of sentences) {
if (current.length + sentence.length > maxChars && current) {
chunks.push(current.trim());
current = current.slice(-overlapChars) + " " + sentence;
} else {
current += (current ? " " : "") + sentence;
}
}
if (current.trim()) chunks.push(current.trim());
return chunks.map((text, index) => ({
text,
metadata: {
platform: doc.platform,
url: doc.url,
videoId: doc.videoId,
chunkIndex: String(index),
...(doc.language ? { language: doc.language } : {}),
},
}));
}For summarization (not retrieval), skip chunking — pass the full plainText in one prompt, or summarize per-chunk and merge. Agent workflows can call transcript endpoints via MCP while building; see Social Fetch with Cursor & Claude.
Billing and lookup status
Every response includes data.lookupStatus and meta.creditsCharged. Branch on status in code — HTTP 200 does not mean you got text.
| Status | Meaning | Charged? |
|---|---|---|
found | Lookup resolved; check transcript fields for actual content | Yes |
not_found | Video/post not reachable or no transcript path (platform-dependent) | Yes — upstream ran |
lookup_failed | Infrastructure could not complete the lookup | No |
503 | Temporary unavailability | No |
YouTube-specific edge case: found + transcript: null means the video exists but has no captions — you are charged, and you need another source for text.
TikTok-specific: useAiFallback adds 10 credits only on completed lookups where fallback ran. Pre-send validation errors are free.
Details: Credits & billing.
What you can build
- Cross-platform repurposing — pull a YouTube long-form script, three TikTok hooks, and an Instagram Reel caption block in one nightly job.
- Creator hook research — batch-transcribe top videos in a niche, extract the first sentence of each, compare what is actually getting watched.
- Brand monitoring — transcript + comments on the same URL for full conversation context.
- Knowledge bases — chunk, embed, and retrieve spoken content from webinars, tutorials, and competitor explainers.
- Accessibility — generate on-video captions from WebVTT or segment data for reposts and internal archives.
Next steps: Playground · TikTok transcript deep-dive · Transcript API use case · API reference · Pricing