The Scrape Job That Never Timed Out

Early in Social Fetch, 37 LinkedIn scrapers ran for hours without failing — and the invoice arrived before the error did. A post-mortem on silent failure and what we built after.

Luke Askew

The Scrape Job That Never Timed Out

Early in Social Fetch, 37 LinkedIn scrapers ran for hours without failing — and the invoice arrived before the error did. A post-mortem on silent failure and what we built after.

I told part of this story in Takanari's founder interview on AI Plaza. Here is the post-mortem — the week a timeout would have saved us four figures.

I have been writing scrapers for years. They break loudly. Selectors move, blocks tighten, auth flows change. You learn to expect that.

The incident that shaped Social Fetch was quieter. Early in development — before we had a polished public API, before every response carried a requestId37 in-house scraper jobs against LinkedIn company pages did not finish and did not fail. They ran for hours. CPU and memory climbed. Sessions stayed open. Nothing returned a terminal HTTP status, because from the worker's point of view the work was still in flight.

So I waited. Then I worked on something else. Scraping is slow sometimes; that felt normal.

It was not slow. It was stuck. By the time I understood that, I had not found out from logs or alerts. I found out from billing — a line item north of $4,000 for compute that produced no useful JSON for anyone. No error code to grep. Nothing to paste into a support ticket. Just success-shaped silence with an invoice attached.

That week hurt. It also taught me what a production social media scraper API has to guarantee before platform names or parsed JSON matter.

Thirty-seven jobs that would not die

LinkedIn is a hard platform. Strong blocks, flaky page shapes, constant churn. We were wiring up company-page routes — the unglamorous work before you have docs and a marketing site — and running real lookups against live pages to see if the parsers held.

Somewhere in that batch, 37 jobs never reached a done state. Not 37 failures. Thirty-seven processes that kept chewing resources because nothing told them to stop.

I only know the count because I traced back through job IDs after the bill landed. During the run itself, the dashboard looked boring. No spike in 5xx errors. No red banner. Just work that never quite finished, which is easy to misread as "LinkedIn being LinkedIn."

An outage is embarrassing but legible: users see errors, you roll back, you explain in Slack. This was worse — quiet spend with no failure signal.

Three states, one invoice

Afterward I kept thinking about a dumb taxonomy that turned out to be useful:

StateWhat your app thinksWhat your wallet feels
SucceededGot JSON, shipped the featureExpected cost
Failed loudError code, retry or stopOften cheap or zero
Failed quietStill "running"Keeps ticking

Most engineering effort goes to the first two rows. The third row is where scraper projects bleed money while dashboards stay calm.

If your integration cannot tell "running" from "broken" inside a bounded time, you do not have a data pipeline — you have a meter in another room. That was us, early on.

What we changed

I will not pretend we reinvented distributed systems. We did the unsexy things you would expect after a four-figure surprise — then we shipped them as product surface, not internal trivia.

Hard timeouts, documented publicly. Every route class got a maximum duration. If a LinkedIn company lookup cannot finish in a bounded window, it fails with a typed error — not an afternoon of RAM usage. Timeout behavior is part of the contract now, not a hope.

Terminal states only. A request is done when the client gets a response with meta.requestId and a known outcome: data, not_found, lookup_failed, or a documented HTTP error. "Still thinking" is not a production state.

The envelope before the payload. Same JSON shape on success and failure. Credits visible on metered routes. Error codes you can branch on. We wrote the design essay in The Envelope is the Product; the post-mortem version is simpler — if support cannot debug your call from one ID, the API is unfinished.

Thorough testing before parallel runs. Smoke-test with GET /v1/whoami (free, no credits) before you fan out real lookups. The Quickstart and boring API post walk through the pattern. I wish I had treated "one stuck job in staging" as a release blocker instead of a shrug.

Redundant passes when something smells wrong. Today, if a lookup looks unhealthy, we spin up another attempt at the same task. The first pass might have died in a ditch; the second often succeeds. That choreography is our problem, not the customer's.

None of this is novel computer science. It is what you build after you pay tuition on row three of the table.

The stupid assumption

The embarrassing part: I assumed "no errors" meant "fine."

I had scripts and workers. I had tests that passed on happy paths. I did not have a habit of asking, every time I kicked off a batch: what happens if this never returns? Timeouts felt like pessimism. They were insurance.

I also treated manual review as optional — glance at dashboards when I remembered, trust that loud failures would find me. Stuck jobs do not always show up as error spikes. CPU creep does not always move the HTTP graph. The invoice moved it.

The lesson was not "scraping is hard." I already knew that. The lesson was: silent failure is more expensive than loud failure, and your observability has to hunt for silence on purpose — not just count 500s.

We are more thorough now. That bill bought the boring stuff.

What customers actually ask

Fixing the plumbing changed the conversations we have.

I thought eval calls would open with "Do you support Pinterest?" or "How many rows per dollar?" They still do sometimes.

What closes deals — especially with high-volume teams running serious throughput — is duller: "What happens when it breaks, and how will I know?" Retry mechanics. When data is temporarily off versus wrong. What to log. When to retry 503 temporarily_unavailable versus email support@socialfetch.dev.

One larger client walked through that in detail before they pointed production traffic at us. Same question as the invoice lesson, just in customer clothes: they are outsourcing failure modes. They want the boring ones handled.

We publish ~3.2s average latency on live fetches and 99.8% uptime over the last 90 days. Those numbers are table stakes. The product is what happens on the bad requests — and whether you can name a single requestId when something looks wrong.

Our support inbox is on alert like any other on-call surface. Acknowledge quickly, fix fast, close the loop with a short breakdown. For incidents that matter, we write them up. Transparency is part of reliability, from the solo developer on 100 free credits to the team burning through Scale packs.

Where to read more

For the "boring social data" philosophy, see Why Social Media APIs Should Be Boring. For integration and retries, Quickstart and Errors. For platform comparisons, Bright Data and Apify. For the founder story in someone else's words, read Takanari's AI Plaza interview.

Me, I am still making sure nothing runs for hours without telling anyone. Once was enough.