General

How to Get YouTube, TikTok & Instagram Transcripts with One API (2026)

Pull spoken text from public YouTube, TikTok, and Instagram Reels as clean JSON — caption tracks, auto-generated speech, optional AI fallback for uncaptioned TikToks, and patterns for batch jobs and RAG pipelines.

Social FetchUpdated

How to Get YouTube, TikTok & Instagram Transcripts with One API (2026)

Pull spoken text from public YouTube, TikTok, and Instagram Reels as clean JSON — caption tracks, auto-generated speech, optional AI fallback for uncaptioned TikToks, and patterns for batch jobs and RAG pipelines.

Video does not embed well. Search engines cannot index spoken words inside a Reel, and most LLM context windows are wasted if you paste a URL and hope the model watched the clip. Transcripts fix that: they turn audio into text you can quote, grep, chunk, and score.

The catch is that YouTube, TikTok, and Instagram each store captions differently — different formats, different fallbacks, different empty states. Social Fetch gives you three GET endpoints behind one auth header. Your job is picking the right one per URL and normalizing the output for whatever comes next (a spreadsheet, a vector index, a summarization job).

The short version

GET /v1/youtube/videos/transcript, GET /v1/tiktok/videos/transcript, and GET /v1/instagram/posts/transcript. Pass a public video URL, read data.lookupStatus, extract text. TikTok supports useAiFallback=true when no caption track exists.

You'll need an API key and curl or the TypeScript SDK. New here? Start with the Quickstart or try lookups in the Playground.

What each platform actually gives you

Before you wire a pipeline, know what comes back — the shapes are not identical:

PlatformEndpointTranscript shapeTimestampsNo-speech fallback
YouTube/v1/youtube/videos/transcriptsegments[] + plainText + languageMillisecond offsets per segmentNone — transcript may be null
TikTok/v1/tiktok/videos/transcriptWebVTT string in transcript.contentIn the VTT cuesuseAiFallback=true (+10 credits)
Instagram/v1/instagram/posts/transcriptPlain text per row in transcripts[]None in the responseNone — text: null when no speech detected

All three share data.lookupStatus, meta.creditsCharged, and meta.requestId. That is enough to build one batch worker; you just need a thin normalization layer on top (covered below).

YouTube transcript

YouTube is the easy case. The UI has a transcript panel; the API returns structured segments you can use as-is or flatten.

Pass a public watch URL:

Request
curl -sS \
  -H "x-api-key: $SOCIALFETCH_API_KEY" \
  -G "https://api.socialfetch.dev/v1/youtube/videos/transcript" \
  --data-urlencode "url=https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Full parameters: YouTube transcript reference.

A successful lookup with captions looks like this:

{
  "data": {
    "lookupStatus": "found",
    "video": {
      "id": "dQw4w9WgXcQ",
      "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
    },
    "transcript": {
      "segments": [
        { "text": "We're no strangers to", "startMs": 18800, "endMs": 25960 }
      ],
      "plainText": "We're no strangers to\nlove. You know the rules...",
      "language": "English"
    }
  },
  "meta": {
    "requestId": "req_01example",
    "creditsCharged": 1,
    "version": "v1"
  }
}

Fields worth remembering:

  • transcript.plainText — ready for embeddings or summarization without parsing.
  • transcript.segments — use when you need clip boundaries ("quote the section at 2:14").
  • transcript.language — human-readable label from the lookup; pair with the language query param when you request a specific track.

YouTube can also return lookupStatus: "found" with transcript: null — the video resolved, but no caption track exists (common on music-only uploads or videos where the creator disabled captions). Treat that as "no text available," not as an error.

WhatCost
Base YouTube transcript lookup1 credit

TikTok transcript

TikTok has no transcript export in the app. Auto-captions show on screen during playback, but there is nothing to copy — and a large share of videos have no captions at all. The API returns WebVTT when a caption track exists.

Request
const videoUrl = "https://www.tiktok.com/@mrbeast/video/7596844935442189598";

const response = await fetch(
  `https://api.socialfetch.dev/v1/tiktok/videos/transcript?url=${encodeURIComponent(videoUrl)}`,
  {
    headers: {
      "x-api-key": process.env.SOCIALFETCH_API_KEY,
    },
  }
);

const body = await response.json();

console.log(response.status, body);

See the dedicated TikTok transcript guide for a full walkthrough (WebVTT parsing, screenshots, legal notes). Reference: TikTok transcript.

WhatCost
Base TikTok transcript lookup1 credit
With useAiFallback=true+10 credits (11 total on a completed lookup)

When TikTok has no captions

If the creator never enabled captions, there is nothing to scrape. Enable AI fallback and the audio is transcribed instead:

Request
const videoUrl = "https://www.tiktok.com/@mrbeast/video/7596844935442189598";

const response = await fetch(
  `https://api.socialfetch.dev/v1/tiktok/videos/transcript?url=${encodeURIComponent(videoUrl)}&useAiFallback=true`,
  {
    headers: {
      "x-api-key": process.env.SOCIALFETCH_API_KEY,
    },
  }
);

const body = await response.json();

console.log(response.status, body);

Use fallback when you need text from every video in a batch. Skip it when you only want existing caption tracks and would rather drop uncaptioned clips.

Reading the TikTok response

Response
json
{
  "data": {
    "lookupStatus": "found",
    "video": {
      "id": "7596844935442189598",
      "url": "https://www.tiktok.com/@mrbeast/video/7596844935442189598"
    },
    "transcript": {
      "format": "webvtt",
      "content": "WEBVTT\n\n00:00:00.060 --> 00:00:03.100\nThis is the world's largest LED floor.\n\n00:00:03.101 --> 00:00:09.433\nAnd now it's the world's largest green screen."
    }
  },
  "meta": {
    "requestId": "req_01example",
    "creditsCharged": 1,
    "version": "v1"
  }
}

Strip WebVTT timing cues when you need flat text:

Example
typescript
import { SocialFetchClient } from "@socialfetch/sdk";

const client = new SocialFetchClient({
  apiKey: process.env.SOCIALFETCH_API_KEY!,
});

const result = await client.tiktok.getVideoTranscript({
  url: "https://www.tiktok.com/@mrbeast/video/7596844935442189598",
});

if (!result.ok) {
  console.error(result.error.code, result.error.requestId);
  process.exit(1);
}

const vtt = result.value.data.transcript?.content ?? "";

const plainText = vtt
  .replace(/\r\n/g, "\n")
  .split("\n")
  .map((line) => line.trim())
  .filter(
    (line) =>
      line &&
      line !== "WEBVTT" &&
      !line.includes("-->") &&
      !/^\d+$/.test(line),
  )
  .join(" ");

console.log(plainText);

Instagram Reels transcript

Instagram Reels and video posts expose speech as plain text — no WebVTT, no segment array. Carousel posts can return multiple transcripts[] rows (one per video item in the carousel).

curl -sS \
  -H "x-api-key: $SOCIALFETCH_API_KEY" \
  -G "https://api.socialfetch.dev/v1/instagram/posts/transcript" \
  --data-urlencode "url=https://www.instagram.com/reel/DHsD6HGqJhp/"

Reference: Instagram post transcript.

{
  "data": {
    "lookupStatus": "found",
    "post": {
      "url": "https://www.instagram.com/reel/DHsD6HGqJhp/"
    },
    "transcripts": [
      {
        "id": "3597267389859272809",
        "shortcode": "DHsD6HGqJhp",
        "text": "Let's fry up the perfect Banh Xeo. Beautiful. Everybody..."
      }
    ]
  },
  "meta": {
    "requestId": "req_01example",
    "creditsCharged": 1,
    "version": "v1"
  }
}

For carousels, loop data.transcripts and concatenate non-null text values. When text is null, the video resolved but no speech was detected (silent Reel, music-only, or very short clip).

WhatCost
Base Instagram transcript lookup1 credit

Instagram has no useAiFallback flag — you get whatever speech detection returns.

Captions vs auto-generated speech

These terms get mixed up constantly. Here is how they map to API behavior:

TermWhat it means on the platformWhat you get from the API
Manual captionsCreator-uploaded or edited SRT/VTTReturned like any other track — YouTube/TikTok do not label "manual" vs "auto" in the response
Auto-generated captionsPlatform speech-to-text (YouTube "auto", TikTok on-screen captions)Same endpoints — you receive the best available track
No captionsCreator disabled them, or TikTok never generated themYouTube: transcript: null. TikTok: not_found unless AI fallback. Instagram: text: null

You cannot request "manual only" or "auto only." Pass language when multiple tracks exist and inspect transcript.language on YouTube. For quality-sensitive workflows (legal review, quote verification), spot-check a sample against the source video — auto-generated tracks mishear proper nouns and punctuation routinely.

Language handling

YouTube and TikTok accept an optional two-letter language query parameter (ISO 639-1) to prefer a specific track:

# YouTube — prefer Spanish when available
curl -sS \
  -H "x-api-key: $SOCIALFETCH_API_KEY" \
  -G "https://api.socialfetch.dev/v1/youtube/videos/transcript" \
  --data-urlencode "url=https://www.youtube.com/watch?v=dQw4w9WgXcQ" \
  --data-urlencode "language=es"

TikTok language selection:

Request
const videoUrl = "https://www.tiktok.com/@mrbeast/video/7596844935442189598";

const response = await fetch(
  `https://api.socialfetch.dev/v1/tiktok/videos/transcript?url=${encodeURIComponent(videoUrl)}&language=en`,
  {
    headers: {
      "x-api-key": process.env.SOCIALFETCH_API_KEY,
    },
  }
);

const body = await response.json();

console.log(response.status, body);

If the requested language is not available, the lookup returns whatever track exists — check the response body rather than assuming the parameter was honored. Instagram has no language parameter; the returned text is in whatever language was spoken.

For multilingual RAG indexes, store language (from YouTube's field or your own detector) on every document chunk so retrievers can filter by locale.

Normalize to one document shape

Production pipelines rarely want three different parsers in every downstream job. Map each platform response to an internal TranscriptDoc:

type TranscriptDoc = {
  platform: "youtube" | "tiktok" | "instagram";
  url: string;
  videoId: string;
  plainText: string;
  language?: string;
  segments?: { text: string; startMs: number; endMs: number }[];
  lookupStatus: string;
  creditsCharged: number;
  requestId: string;
};

function vttToPlainText(vtt: string): string {
  return vtt
    .replace(/\r\n/g, "\n")
    .split("\n")
    .map((line) => line.trim())
    .filter(
      (line) =>
        line &&
        line !== "WEBVTT" &&
        !line.includes("-->") &&
        !/^\d+$/.test(line),
    )
    .join(" ");
}

function normalizeYouTube(body: {
  data: {
    lookupStatus: string;
    video: { id: string; url: string } | null;
    transcript: {
      plainText: string;
      language: string;
      segments: { text: string; startMs: number; endMs: number }[];
    } | null;
  };
  meta: { creditsCharged: number; requestId: string };
}): TranscriptDoc | null {
  if (!body.data.video) return null;
  return {
    platform: "youtube",
    url: body.data.video.url,
    videoId: body.data.video.id,
    plainText: body.data.transcript?.plainText ?? "",
    language: body.data.transcript?.language,
    segments: body.data.transcript?.segments,
    lookupStatus: body.data.lookupStatus,
    creditsCharged: body.meta.creditsCharged,
    requestId: body.meta.requestId,
  };
}

function normalizeTikTok(body: {
  data: {
    lookupStatus: string;
    video: { id: string; url: string } | null;
    transcript: { content: string } | null;
  };
  meta: { creditsCharged: number; requestId: string };
}): TranscriptDoc | null {
  if (!body.data.video) return null;
  const vtt = body.data.transcript?.content ?? "";
  return {
    platform: "tiktok",
    url: body.data.video.url,
    videoId: body.data.video.id,
    plainText: vttToPlainText(vtt),
    lookupStatus: body.data.lookupStatus,
    creditsCharged: body.meta.creditsCharged,
    requestId: body.meta.requestId,
  };
}

function normalizeInstagram(body: {
  data: {
    lookupStatus: string;
    post: { url: string } | null;
    transcripts: { id: string; text: string | null }[];
  };
  meta: { creditsCharged: number; requestId: string };
}): TranscriptDoc | null {
  if (!body.data.post) return null;
  const plainText = body.data.transcripts
    .map((row) => row.text)
    .filter((t): t is string => Boolean(t))
    .join("\n\n");
  return {
    platform: "instagram",
    url: body.data.post.url,
    videoId: body.data.transcripts[0]?.id ?? "",
    plainText,
    lookupStatus: body.data.lookupStatus,
    creditsCharged: body.meta.creditsCharged,
    requestId: body.meta.requestId,
  };
}

One normalizer per platform, one schema for everything downstream.

Batch processing

Research and repurposing jobs usually start with a list of URLs — a CSV export, a search result, a creator's recent uploads. Fetch them with bounded concurrency so you do not hammer the API or your own rate limits:

import { SocialFetchClient } from "@socialfetch/sdk";

const client = new SocialFetchClient({
  apiKey: process.env.SOCIALFETCH_API_KEY!,
});

type Job = { platform: "youtube" | "tiktok" | "instagram"; url: string };

const jobs: Job[] = [
  { platform: "youtube", url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ" },
  {
    platform: "tiktok",
    url: "https://www.tiktok.com/@mrbeast/video/7596844935442189598",
  },
  {
    platform: "instagram",
    url: "https://www.instagram.com/reel/DHsD6HGqJhp/",
  },
];

async function fetchTranscript(job: Job) {
  if (job.platform === "youtube") {
    return client.youtube.getVideoTranscript({ url: job.url });
  }
  if (job.platform === "tiktok") {
    return client.tiktok.getVideoTranscript({
      url: job.url,
      useAiFallback: true, // set false to skip uncaptioned videos
    });
  }
  return client.instagram.getPostTranscript({ url: job.url });
}

const CONCURRENCY = 5;
const results: TranscriptDoc[] = [];

for (let i = 0; i < jobs.length; i += CONCURRENCY) {
  const batch = jobs.slice(i, i + CONCURRENCY);
  const settled = await Promise.all(batch.map(fetchTranscript));

  for (const result of settled) {
    if (!result.ok) {
      console.error(result.error.code, result.error.requestId);
      continue;
    }
    const doc =
      result.value.data.lookupStatus === "found"
        ? normalizeFromEnvelope(result.value) // wire your normalizers here
        : null;
    if (doc?.plainText) results.push(doc);
  }
}

console.log(results.length, "transcripts ready");

Practical batch tips:

  • Budget credits upfront. TikTok with AI fallback costs 11 credits per completed lookup. A thousand-video batch is not a rounding error — multiply before you run.
  • Persist requestId on every row. When a transcript looks wrong, support can trace the exact lookup.
  • Skip empty text, not failed HTTP. lookupStatus: "not_found" still returns HTTP 200 and is charged. Filter in application logic.
  • Discover URLs first. Pair transcript fetches with profile/video listing endpoints (TikTok profile videos, YouTube channel) so you are not hand-curating links.

Chunking for RAG and LLMs

Once you have plainText, the embedding step is standard — but video transcripts benefit from a few conventions:

  1. Chunk by sentence, not arbitrary character count. Caption segments often break mid-phrase; rejoin into sentences before splitting.
  2. Target 300–600 tokens per chunk with ~50-token overlap. Short-form video (TikTok, Reels) may fit in one chunk; long YouTube uploads need many.
  3. Attach metadata on every chunk: platform, url, videoId, language, startMs/endMs when you have segments.
  4. Store the raw timed version separately. Retrieval uses plain text; clip generation and citation UI need timestamps from YouTube segments or TikTok WebVTT.
function chunkForRag(
  doc: TranscriptDoc,
  maxChars = 2000,
  overlapChars = 200,
): { text: string; metadata: Record<string, string> }[] {
  const sentences = doc.plainText
    .replace(/\s+/g, " ")
    .split(/(?<=[.!?])\s+/)
    .filter(Boolean);

  const chunks: string[] = [];
  let current = "";

  for (const sentence of sentences) {
    if (current.length + sentence.length > maxChars && current) {
      chunks.push(current.trim());
      current = current.slice(-overlapChars) + " " + sentence;
    } else {
      current += (current ? " " : "") + sentence;
    }
  }
  if (current.trim()) chunks.push(current.trim());

  return chunks.map((text, index) => ({
    text,
    metadata: {
      platform: doc.platform,
      url: doc.url,
      videoId: doc.videoId,
      chunkIndex: String(index),
      ...(doc.language ? { language: doc.language } : {}),
    },
  }));
}

For summarization (not retrieval), skip chunking — pass the full plainText in one prompt, or summarize per-chunk and merge. Agent workflows can call transcript endpoints via MCP while building; see Social Fetch with Cursor & Claude.

Billing and lookup status

Every response includes data.lookupStatus and meta.creditsCharged. Branch on status in code — HTTP 200 does not mean you got text.

StatusMeaningCharged?
foundLookup resolved; check transcript fields for actual contentYes
not_foundVideo/post not reachable or no transcript path (platform-dependent)Yes — upstream ran
lookup_failedInfrastructure could not complete the lookupNo
503Temporary unavailabilityNo

YouTube-specific edge case: found + transcript: null means the video exists but has no captions — you are charged, and you need another source for text.

TikTok-specific: useAiFallback adds 10 credits only on completed lookups where fallback ran. Pre-send validation errors are free.

Details: Credits & billing.

What you can build

  • Cross-platform repurposing — pull a YouTube long-form script, three TikTok hooks, and an Instagram Reel caption block in one nightly job.
  • Creator hook research — batch-transcribe top videos in a niche, extract the first sentence of each, compare what is actually getting watched.
  • Brand monitoring — transcript + comments on the same URL for full conversation context.
  • Knowledge bases — chunk, embed, and retrieve spoken content from webinars, tutorials, and competitor explainers.
  • Accessibility — generate on-video captions from WebVTT or segment data for reposts and internal archives.

Next steps: Playground · TikTok transcript deep-dive · Transcript API use case · API reference · Pricing