Web Scraping API: Extract Markdown & HTML from Any URL for LLMs

We kept hearing the same thing from teams already using Social Fetch: the data pipeline never stays completely inside social platforms.

You pull a creator profile from TikTok and then need the Linktree in their bio. You monitor a competitor on X and then need the pricing page they just linked. Half the context in an AI research agent comes from blog posts and docs, not Instagram posts.

Every time, the answer was a second tool: a Puppeteer container nobody wanted to babysit, a generic web scraping API with different auth and billing, or a hand-rolled /utils/scrape.ts that broke every time a site changed its DOM.

Now it's just another Social Fetch call

Pass any public URL, get structured content back. Same x-api-key, same { data, meta } envelope, same one-credit pricing. No second vendor.

The problem with traditional web scraping

The social data was the easy part—Social Fetch already handled that. The painful part was extracting the context around social media:

A creator's Shopify store or personal blog
Competitor pricing pages, changelogs, and feature matrices
Documentation and support articles that an AI agent needs to ground its reasoning
Press releases and news articles linked from viral social posts

These are not edge cases. They are the other half of every RAG pipeline that touches creators, brands, or content. And until now, handling them meant context-switching to a different scraping API.

The best architecture is one where "I need the text from this URL too" does not require a new vendor evaluation or spinning up headless browsers.

How our web extraction API works

We shipped four new routes under /v1/web/*, using the credentials and predictable JSON envelope you already know:

Extract to Markdown — clean text with filter modes for LLM readability, raw DOM extraction, or BM25 relevance-ranked passages
Extract to HTML — sanitized HTML when your downstream tool expects markup
Ask a Web Page — pass a URL and a question, get a grounded answer back
Crawl Multiple URLs — batch up to five URLs in one synchronous request, with per-page results

All four routes cost one credit per page, the same as a TikTok profile lookup. No multipliers, no complex response-size tiers.

Here is what the code looks like using our SDK:

// Extract clean markdown for RAG pipelines
const page = await client.web.getMarkdown({
  url: "https://competitor.com/pricing",
  filter: "fit", // readability-optimized
});

if (page.ok) {
  // page.value.data.markdown.fit is ready for your LLM or vector DB
  console.log(page.value.data.markdown.fit);
}

// Ask a direct question of a web page
const answer = await client.web.ask({
  url: "https://competitor.com/pricing",
  q: "What is the cheapest plan?",
});

if (answer.ok) {
  console.log(answer.value.data.answer);
}

Markdown extraction filters

fit (default) — Readability-optimized, stripped of nav bars and noise. raw — Fuller DOM conversion to markdown. bm25 — Relevance-ranked against a query you provide, acting like semantic search over a single page.

Best use cases for LLM web scraping

For AI and RAG pipelines, the markdown endpoint with filter=fit strips navigation chrome and gives you content ready for chunking and embedding. If you only need passages relevant to a specific topic, filter=bm25 with a query returns just the sections that score, like semantic search over a single page without the expensive indexing step.

Product teams use the crawl endpoint to grab a pricing page, a features page, and an about page in one call — one request, three credits, three pages of structured markdown you can diff weekly.

The ask endpoint takes a URL and a natural-language question, reads the page, and returns a grounded answer. Teams use it for extracting structured facts ("what is the return policy?"), powering tooltips that explain linked content, or building quick comparison tools without writing a custom parser per site.

You can also chain it beside social API calls: pull a creator profile, see a URL in their bio, and make one more request to grab the full text of whatever they are linking to — same SDK client, no second integration.

Honest limitations of the web scraper

We want to be transparent about what this API is not:

Public pages only. If a human needs a login, a session token, or a cookie to see it, our extraction API will not see it either.
Bot-protected sites. Pages heavily guarded by advanced WAFs or anti-bot challenges will return lookupStatus: "restricted" (HTTP 200), not a retryable error.
Live fetch, not cached. Expect seconds, not milliseconds. You get the page as it exists right now, not an archived snapshot from an hour ago.
Crawl is small and synchronous. Five URLs max per request. Built for "grab these specific linked pages," not "spider this entire root domain."
Ask is page-grounded. It reads what is publicly visible on the URL. It is not a hallucinating chatbot.

Start scraping URLs to markdown

If you already have a Social Fetch API key, these web endpoints are live now. No opt-in, no waitlist.

Open the playground under Web, paste a URL, and see the extracted markdown response. Wire it into your backend with the same client.web.* methods in the TypeScript SDK. Check Pricing to estimate volume — one credit per page, same as our social API endpoints.

Quickstart Guide

Auth, first request, and understanding the response envelope.

API Reference

Full parameters and response shapes for all four Web scraping routes.

Social Fetch is still 20+ social platforms in one integration. Web extraction is simply applying the same API key and the same JSON discipline for the URLs that exist outside those networks—because that's where the rest of the work gets done.