Do YouTube, TikTok, and Instagram use the same response shape?

The outer envelope is the same — `{ data, meta }` with `lookupStatus` and `creditsCharged`. The transcript payload differs by platform: YouTube returns timed `segments` plus `plainText`, TikTok returns WebVTT in `transcript.content`, and Instagram returns an array of plain-text rows in `transcripts`. Normalize once in your app if you need a single internal schema.

What if a video has no transcript?

Check `data.lookupStatus` and the transcript fields. YouTube can return `found` with `transcript: null` when the video exists but has no caption track. TikTok returns `not_found` when no captions exist unless you set `useAiFallback=true` (+10 credits). Instagram returns `text: null` on a transcript row when no speech was detected. Completed `not_found` lookups are charged because upstream ran.

Can I get auto-generated captions, not just manual ones?

Yes on YouTube and TikTok — the API returns whatever caption track YouTube or TikTok exposes, including auto-generated speech. You do not pick manual vs auto in the request; you get the best available track. On Instagram, speech is transcribed from the Reel audio when available.

How do I handle multiple languages?

YouTube and TikTok accept an optional two-letter `language` query parameter (ISO 639-1) to prefer a specific track when several exist. If the language is unavailable, you get what the platform has. Instagram has no language parameter — check the returned text.

What format should I store for search and RAG?

Store plain text for embeddings and retrieval. Keep timed segments or WebVTT separately if you need clip boundaries or on-video captions. Chunk by sentence or ~500-token windows with overlap; attach `video.url`, platform, and `language` as metadata on every chunk.

Are failed infrastructure lookups charged?

No. `lookup_failed` and `503 temporarily_unavailable` are not charged. Pre-send validation errors are free. A completed `not_found` lookup did run upstream and is billed.

How to Get YouTube, TikTok & Instagram Transcripts with One API (2026)

Video does not embed well. Search engines cannot index spoken words inside a Reel, and most LLM context windows are wasted if you paste a URL and hope the model watched the clip. Transcripts fix that: they turn audio into text you can quote, grep, chunk, and score.

The catch is that YouTube, TikTok, and Instagram each store captions differently — different formats, different fallbacks, different empty states. Social Fetch gives you three GET endpoints behind one auth header. Your job is picking the right one per URL and normalizing the output for whatever comes next (a spreadsheet, a vector index, a summarization job).

The short version

GET /v1/youtube/videos/transcript, GET /v1/tiktok/videos/transcript, and GET /v1/instagram/posts/transcript. Pass a public video URL, read data.lookupStatus, extract text. TikTok supports useAiFallback=true when no caption track exists.

You'll need an API key and curl or the TypeScript SDK. New here? Start with the Quickstart or try lookups in the Playground.

What each platform actually gives you

Before you wire a pipeline, know what comes back — the shapes are not identical:

Platform	Endpoint	Transcript shape	Timestamps	No-speech fallback
YouTube	`/v1/youtube/videos/transcript`	`segments[]` + `plainText` + `language`	Millisecond offsets per segment	None — `transcript` may be `null`
TikTok	`/v1/tiktok/videos/transcript`	WebVTT string in `transcript.content`	In the VTT cues	`useAiFallback=true` (+10 credits)
Instagram	`/v1/instagram/posts/transcript`	Plain `text` per row in `transcripts[]`	None in the response	None — `text: null` when no speech detected

All three share data.lookupStatus, meta.creditsCharged, and meta.requestId. That is enough to build one batch worker; you just need a thin normalization layer on top (covered below).

YouTube transcript

YouTube is the easy case. The UI has a transcript panel; the API returns structured segments you can use as-is or flatten.

Pass a public watch URL:

curl -sS \
  -H "x-api-key: $SOCIALFETCH_API_KEY" \
  -G "https://api.socialfetch.dev/v1/youtube/videos/transcript" \
  --data-urlencode "url=https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Full parameters: YouTube transcript reference.

A successful lookup with captions looks like this:

{
  "data": {
    "lookupStatus": "found",
    "video": {
      "id": "dQw4w9WgXcQ",
      "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
    },
    "transcript": {
      "segments": [
        { "text": "We're no strangers to", "startMs": 18800, "endMs": 25960 }
      ],
      "plainText": "We're no strangers to\nlove. You know the rules...",
      "language": "English"
    }
  },
  "meta": {
    "requestId": "req_01example",
    "creditsCharged": 1,
    "version": "v1"
  }
}

Fields worth remembering:

transcript.plainText — ready for embeddings or summarization without parsing.
transcript.segments — use when you need clip boundaries ("quote the section at 2:14").
transcript.language — human-readable label from the lookup; pair with the language query param when you request a specific track.

YouTube can also return lookupStatus: "found" with transcript: null — the video resolved, but no caption track exists (common on music-only uploads or videos where the creator disabled captions). Treat that as "no text available," not as an error.

What	Cost
Base YouTube transcript lookup	1 credit

TikTok transcript

TikTok has no transcript export in the app. Auto-captions show on screen during playback, but there is nothing to copy — and a large share of videos have no captions at all. The API returns WebVTT when a caption track exists.

const videoUrl = "https://www.tiktok.com/@mrbeast/video/7596844935442189598";

const response = await fetch(
  `https://api.socialfetch.dev/v1/tiktok/videos/transcript?url=${encodeURIComponent(videoUrl)}`,
  {
    headers: {
      "x-api-key": process.env.SOCIALFETCH_API_KEY,
    },
  }
);

const body = await response.json();

console.log(response.status, body);

See the dedicated TikTok transcript guide for a full walkthrough (WebVTT parsing, screenshots, legal notes). Reference: TikTok transcript.

What	Cost
Base TikTok transcript lookup	1 credit
With `useAiFallback=true`	+10 credits (11 total on a completed lookup)

When TikTok has no captions

If the creator never enabled captions, there is nothing to scrape. Enable AI fallback and the audio is transcribed instead:

const videoUrl = "https://www.tiktok.com/@mrbeast/video/7596844935442189598";

const response = await fetch(
  `https://api.socialfetch.dev/v1/tiktok/videos/transcript?url=${encodeURIComponent(videoUrl)}&useAiFallback=true`,
  {
    headers: {
      "x-api-key": process.env.SOCIALFETCH_API_KEY,
    },
  }
);

const body = await response.json();

console.log(response.status, body);

Use fallback when you need text from every video in a batch. Skip it when you only want existing caption tracks and would rather drop uncaptioned clips.

Reading the TikTok response

json

{
  "data": {
    "lookupStatus": "found",
    "video": {
      "id": "7596844935442189598",
      "url": "https://www.tiktok.com/@mrbeast/video/7596844935442189598"
    },
    "transcript": {
      "format": "webvtt",
      "content": "WEBVTT\n\n00:00:00.060 --> 00:00:03.100\nThis is the world's largest LED floor.\n\n00:00:03.101 --> 00:00:09.433\nAnd now it's the world's largest green screen."
    }
  },
  "meta": {
    "requestId": "req_01example",
    "creditsCharged": 1,
    "version": "v1"
  }
}

Strip WebVTT timing cues when you need flat text:

typescript

import { SocialFetchClient } from "@socialfetch/sdk";

const client = new SocialFetchClient({
  apiKey: process.env.SOCIALFETCH_API_KEY!,
});

const result = await client.tiktok.getVideoTranscript({
  url: "https://www.tiktok.com/@mrbeast/video/7596844935442189598",
});

if (!result.ok) {
  console.error(result.error.code, result.error.requestId);
  process.exit(1);
}

const vtt = result.value.data.transcript?.content ?? "";

const plainText = vtt
  .replace(/\r\n/g, "\n")
  .split("\n")
  .map((line) => line.trim())
  .filter(
    (line) =>
      line &&
      line !== "WEBVTT" &&
      !line.includes("-->") &&
      !/^\d+$/.test(line),
  )
  .join(" ");

console.log(plainText);

Instagram Reels transcript

Instagram Reels and video posts expose speech as plain text — no WebVTT, no segment array. Carousel posts can return multiple transcripts[] rows (one per video item in the carousel).

curl -sS \
  -H "x-api-key: $SOCIALFETCH_API_KEY" \
  -G "https://api.socialfetch.dev/v1/instagram/posts/transcript" \
  --data-urlencode "url=https://www.instagram.com/reel/DHsD6HGqJhp/"

Reference: Instagram post transcript.

{
  "data": {
    "lookupStatus": "found",
    "post": {
      "url": "https://www.instagram.com/reel/DHsD6HGqJhp/"
    },
    "transcripts": [
      {
        "id": "3597267389859272809",
        "shortcode": "DHsD6HGqJhp",
        "text": "Let's fry up the perfect Banh Xeo. Beautiful. Everybody..."
      }
    ]
  },
  "meta": {
    "requestId": "req_01example",
    "creditsCharged": 1,
    "version": "v1"
  }
}

For carousels, loop data.transcripts and concatenate non-null text values. When text is null, the video resolved but no speech was detected (silent Reel, music-only, or very short clip).

What	Cost
Base Instagram transcript lookup	1 credit

Instagram has no useAiFallback flag — you get whatever speech detection returns.

Captions vs auto-generated speech

These terms get mixed up constantly. Here is how they map to API behavior:

Term	What it means on the platform	What you get from the API
Manual captions	Creator-uploaded or edited SRT/VTT	Returned like any other track — YouTube/TikTok do not label "manual" vs "auto" in the response
Auto-generated captions	Platform speech-to-text (YouTube "auto", TikTok on-screen captions)	Same endpoints — you receive the best available track
No captions	Creator disabled them, or TikTok never generated them	YouTube: `transcript: null`. TikTok: `not_found` unless AI fallback. Instagram: `text: null`

You cannot request "manual only" or "auto only." Pass language when multiple tracks exist and inspect transcript.language on YouTube. For quality-sensitive workflows (legal review, quote verification), spot-check a sample against the source video — auto-generated tracks mishear proper nouns and punctuation routinely.

Language handling

YouTube and TikTok accept an optional two-letter language query parameter (ISO 639-1) to prefer a specific track:

# YouTube — prefer Spanish when available
curl -sS \
  -H "x-api-key: $SOCIALFETCH_API_KEY" \
  -G "https://api.socialfetch.dev/v1/youtube/videos/transcript" \
  --data-urlencode "url=https://www.youtube.com/watch?v=dQw4w9WgXcQ" \
  --data-urlencode "language=es"

TikTok language selection:

const videoUrl = "https://www.tiktok.com/@mrbeast/video/7596844935442189598";

const response = await fetch(
  `https://api.socialfetch.dev/v1/tiktok/videos/transcript?url=${encodeURIComponent(videoUrl)}&language=en`,
  {
    headers: {
      "x-api-key": process.env.SOCIALFETCH_API_KEY,
    },
  }
);

const body = await response.json();

console.log(response.status, body);

If the requested language is not available, the lookup returns whatever track exists — check the response body rather than assuming the parameter was honored. Instagram has no language parameter; the returned text is in whatever language was spoken.

For multilingual RAG indexes, store language (from YouTube's field or your own detector) on every document chunk so retrievers can filter by locale.

Normalize to one document shape

Production pipelines rarely want three different parsers in every downstream job. Map each platform response to an internal TranscriptDoc:

type TranscriptDoc = {
  platform: "youtube" | "tiktok" | "instagram";
  url: string;
  videoId: string;
  plainText: string;
  language?: string;
  segments?: { text: string; startMs: number; endMs: number }[];
  lookupStatus: string;
  creditsCharged: number;
  requestId: string;
};

function vttToPlainText(vtt: string): string {
  return vtt
    .replace(/\r\n/g, "\n")
    .split("\n")
    .map((line) => line.trim())
    .filter(
      (line) =>
        line &&
        line !== "WEBVTT" &&
        !line.includes("-->") &&
        !/^\d+$/.test(line),
    )
    .join(" ");
}

function normalizeYouTube(body: {
  data: {
    lookupStatus: string;
    video: { id: string; url: string } | null;
    transcript: {
      plainText: string;
      language: string;
      segments: { text: string; startMs: number; endMs: number }[];
    } | null;
  };
  meta: { creditsCharged: number; requestId: string };
}): TranscriptDoc | null {
  if (!body.data.video) return null;
  return {
    platform: "youtube",
    url: body.data.video.url,
    videoId: body.data.video.id,
    plainText: body.data.transcript?.plainText ?? "",
    language: body.data.transcript?.language,
    segments: body.data.transcript?.segments,
    lookupStatus: body.data.lookupStatus,
    creditsCharged: body.meta.creditsCharged,
    requestId: body.meta.requestId,
  };
}

function normalizeTikTok(body: {
  data: {
    lookupStatus: string;
    video: { id: string; url: string } | null;
    transcript: { content: string } | null;
  };
  meta: { creditsCharged: number; requestId: string };
}): TranscriptDoc | null {
  if (!body.data.video) return null;
  const vtt = body.data.transcript?.content ?? "";
  return {
    platform: "tiktok",
    url: body.data.video.url,
    videoId: body.data.video.id,
    plainText: vttToPlainText(vtt),
    lookupStatus: body.data.lookupStatus,
    creditsCharged: body.meta.creditsCharged,
    requestId: body.meta.requestId,
  };
}

function normalizeInstagram(body: {
  data: {
    lookupStatus: string;
    post: { url: string } | null;
    transcripts: { id: string; text: string | null }[];
  };
  meta: { creditsCharged: number; requestId: string };
}): TranscriptDoc | null {
  if (!body.data.post) return null;
  const plainText = body.data.transcripts
    .map((row) => row.text)
    .filter((t): t is string => Boolean(t))
    .join("\n\n");
  return {
    platform: "instagram",
    url: body.data.post.url,
    videoId: body.data.transcripts[0]?.id ?? "",
    plainText,
    lookupStatus: body.data.lookupStatus,
    creditsCharged: body.meta.creditsCharged,
    requestId: body.meta.requestId,
  };
}

One normalizer per platform, one schema for everything downstream.

Batch processing

Research and repurposing jobs usually start with a list of URLs — a CSV export, a search result, a creator's recent uploads. Fetch them with bounded concurrency so you do not hammer the API or your own rate limits:

import { SocialFetchClient } from "@socialfetch/sdk";

const client = new SocialFetchClient({
  apiKey: process.env.SOCIALFETCH_API_KEY!,
});

type Job = { platform: "youtube" | "tiktok" | "instagram"; url: string };

const jobs: Job[] = [
  { platform: "youtube", url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ" },
  {
    platform: "tiktok",
    url: "https://www.tiktok.com/@mrbeast/video/7596844935442189598",
  },
  {
    platform: "instagram",
    url: "https://www.instagram.com/reel/DHsD6HGqJhp/",
  },
];

async function fetchTranscript(job: Job) {
  if (job.platform === "youtube") {
    return client.youtube.getVideoTranscript({ url: job.url });
  }
  if (job.platform === "tiktok") {
    return client.tiktok.getVideoTranscript({
      url: job.url,
      useAiFallback: true, // set false to skip uncaptioned videos
    });
  }
  return client.instagram.getPostTranscript({ url: job.url });
}

const CONCURRENCY = 5;
const results: TranscriptDoc[] = [];

for (let i = 0; i < jobs.length; i += CONCURRENCY) {
  const batch = jobs.slice(i, i + CONCURRENCY);
  const settled = await Promise.all(batch.map(fetchTranscript));

  for (const result of settled) {
    if (!result.ok) {
      console.error(result.error.code, result.error.requestId);
      continue;
    }
    const doc =
      result.value.data.lookupStatus === "found"
        ? normalizeFromEnvelope(result.value) // wire your normalizers here
        : null;
    if (doc?.plainText) results.push(doc);
  }
}

console.log(results.length, "transcripts ready");

Practical batch tips:

Budget credits upfront. TikTok with AI fallback costs 11 credits per completed lookup. A thousand-video batch is not a rounding error — multiply before you run.
Persist requestId on every row. When a transcript looks wrong, support can trace the exact lookup.
Skip empty text, not failed HTTP. lookupStatus: "not_found" still returns HTTP 200 and is charged. Filter in application logic.
Discover URLs first. Pair transcript fetches with profile/video listing endpoints (TikTok profile videos, YouTube channel) so you are not hand-curating links.

Chunking for RAG and LLMs

Once you have plainText, the embedding step is standard — but video transcripts benefit from a few conventions:

Chunk by sentence, not arbitrary character count. Caption segments often break mid-phrase; rejoin into sentences before splitting.
Target 300–600 tokens per chunk with ~50-token overlap. Short-form video (TikTok, Reels) may fit in one chunk; long YouTube uploads need many.
Attach metadata on every chunk: platform, url, videoId, language, startMs/endMs when you have segments.
Store the raw timed version separately. Retrieval uses plain text; clip generation and citation UI need timestamps from YouTube segments or TikTok WebVTT.

function chunkForRag(
  doc: TranscriptDoc,
  maxChars = 2000,
  overlapChars = 200,
): { text: string; metadata: Record<string, string> }[] {
  const sentences = doc.plainText
    .replace(/\s+/g, " ")
    .split(/(?<=[.!?])\s+/)
    .filter(Boolean);

  const chunks: string[] = [];
  let current = "";

  for (const sentence of sentences) {
    if (current.length + sentence.length > maxChars && current) {
      chunks.push(current.trim());
      current = current.slice(-overlapChars) + " " + sentence;
    } else {
      current += (current ? " " : "") + sentence;
    }
  }
  if (current.trim()) chunks.push(current.trim());

  return chunks.map((text, index) => ({
    text,
    metadata: {
      platform: doc.platform,
      url: doc.url,
      videoId: doc.videoId,
      chunkIndex: String(index),
      ...(doc.language ? { language: doc.language } : {}),
    },
  }));
}

For summarization (not retrieval), skip chunking — pass the full plainText in one prompt, or summarize per-chunk and merge. Agent workflows can call transcript endpoints via MCP while building; see Social Fetch with Cursor & Claude.

Billing and lookup status

Every response includes data.lookupStatus and meta.creditsCharged. Branch on status in code — HTTP 200 does not mean you got text.

Status	Meaning	Charged?
`found`	Lookup resolved; check transcript fields for actual content	Yes
`not_found`	Video/post not reachable or no transcript path (platform-dependent)	Yes — upstream ran
`lookup_failed`	Infrastructure could not complete the lookup	No
`503`	Temporary unavailability	No

YouTube-specific edge case: found + transcript: null means the video exists but has no captions — you are charged, and you need another source for text.

TikTok-specific: useAiFallback adds 10 credits only on completed lookups where fallback ran. Pre-send validation errors are free.

Details: Credits & billing.

What you can build

Cross-platform repurposing — pull a YouTube long-form script, three TikTok hooks, and an Instagram Reel caption block in one nightly job.
Creator hook research — batch-transcribe top videos in a niche, extract the first sentence of each, compare what is actually getting watched.
Brand monitoring — transcript + comments on the same URL for full conversation context.
Knowledge bases — chunk, embed, and retrieve spoken content from webinars, tutorials, and competitor explainers.
Accessibility — generate on-video captions from WebVTT or segment data for reposts and internal archives.

Next steps: Playground · TikTok transcript deep-dive · Transcript API use case · API reference · Pricing