Inside the RSS Pipeline
A look at the cron job that fans out across 16 publisher feeds every 10 minutes, normalises the chaos of RSS, and serves your feed in milliseconds.
Bytes Team
Most news apps use one of two models: (1) scrape everything and hope the ranking algorithm sorts it out, or (2) have an LLM hallucinate summaries from a search index. Both have well-known failure modes — algorithmic feeds chase engagement, generated summaries fabricate quotes.
We took the boring path: read public RSS feeds, surface the publisher's own headlines, link straight to the source. No model in the serving path means no hallucinations and no per-request inference cost.
The pipeline
Every 10 minutes a Vercel Cron job fires /api/cron/sync-news. The route reads our 16-feed registry (lib/rss/feeds.ts) and fans out with Promise.allSettledso a slow publisher can't starve the others. Each feed call is wrapped in an AbortController with an 8-second ceiling, well below the 15-second outer sync timeout.
Every item flows through three layers before it reaches your feed:
rss-parserwith the Media RSS namespace registered, so we can pullmedia:thumbnail/media:contentimages that BBC, NPR, and The Verge use instead ofenclosure- An allowlist sanitiser that decodes entities, strips HTML, and collapses whitespace — applied to titles and descriptions before they enter the schema
- A URL validator that blocks private IPs, localhost, embedded credentials, and URLs over 2048 characters
Dedup and storage
Two publishers often syndicate the same wire story. We collapse duplicates by hashing the source URL and using that hash as the article's _id, so an upsert is a no-op when the same link comes through twice.
Articles auto-expire after 24 hours via a TTL index on the articles collection — Mongo's background reaper drops them without an external cron job.
Serving
The cache layer is two-tier: hot results live in an in-memory Map (max 500 entries, Zod-validated on every read), and durable rows live in MongoDB. If both miss we kick a synchronous sync, but the cron usually beats the user to it.
What's next
The roadmap is more publishers, not more cleverness — Reuters, Bloomberg, and Wired are on the shortlist. The fan-out architecture handles new feeds with one entry in the registry, so the marginal cost of another publisher is roughly zero.
Last updated: