System Design: Web Crawler — From BFS to Petabyte-Scale Crawling

Series System Design Interview Prep — #7 of 15

Google’s crawler visits
trillions of pages and
re-crawls the entire
known web every few
days. Infrastructure
for that dwarfs most
national data centres.

A web crawler sounds almost trivially simple: follow links, download pages, repeat. Every computer science student has written one. Yet at the scale of a search engine — 1 billion pages, 1 month, fault-tolerant, polite — nearly every naïve assumption breaks spectacularly. This article dismantles the problem layer by layer, from a 30-line Python prototype to a distributed architecture that could actually power a real search index.

The question: Design a web crawler for a search engine. Crawl 1 billion pages in a month. Handle duplicates, politeness, failures, and freshness.


1. The Problem

A web crawler visits URLs, downloads HTML, extracts new links, enqueues them, and repeats. The algorithm — breadth-first search on the web graph — fits in a whiteboard sketch. The scale does not.

1B
pages / month
target
~400
pages / second
1B ÷ 30 days
40 MB/s
raw download
@100KB / page
~100TB
storage / month
HTML + metadata

Every requirement the interviewer adds — deduplication, politeness, fault tolerance, freshness — forces a new architectural layer. Let’s build them one at a time.


2. Level 1 — Single-Threaded BFS

Start with the simplest possible implementation. It works for thousands of URLs, and it illustrates every bottleneck we’ll need to fix.

Python
from collections import deque
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

def crawl(seed_urls, max_pages=1000):
    queue   = deque(seed_urls)
    visited = set()          # ← naive dedup: O(n) memory
    pages   = []

    while queue and len(pages) < max_pages:
        url = queue.popleft()
        if url in visited:
            continue
        visited.add(url)

        try:
            resp = requests.get(url, timeout=5)
            if resp.status_code != 200:
                continue
            pages.append((url, resp.text))
            soup = BeautifulSoup(resp.text, "html.parser")
            for tag in soup.find_all("a", href=True):
                link = urljoin(url, tag["href"])
                if link not in visited:
                    queue.append(link)
        except Exception:
            pass          # no retry, no logging

    return pages

Bottlenecks at scale:

  • Single thread → 1 page at a time. At 1s/page, 1B pages = 31 years.
  • visited set at 1B URLs × ~100 bytes = 100 GB RAM just for deduplication.
  • No politeness — will get IP-banned within minutes.
  • No content deduplication — mirrors get stored repeatedly.

3. Level 2 — URL Deduplication with Bloom Filters

A Bloom filter for 1B URLs
uses ~1.2 GB at 1% false
positive rate. Without it,
deduplication alone would
need 100 GB of RAM —
per crawler node.

The naïve HashSet approach is unusable at scale. Enter the Bloom filter — a probabilistic data structure that trades a small false-positive rate for enormous memory savings. (Covered in depth in Series #1.)

Memory comparison for 1 billion URLs:

Memory model
# HashSet (Python dict) — exact dedup
1_000_000_000 urls × 100 bytes = 100 GB   ← not viable per node

# Bloom filter — probabilistic dedup (1% FPR)
m = -(n × ln p) / (ln 2)²
  = -(1e9 × ln 0.01) / 0.480
   9.585 billion bits  1.2 GB   ← 83× smaller
Memory footprint — 1 billion URLs
HashSet
100 GB
Bloom filter
1.2 GB
Trade-off: A Bloom filter can produce false positives (1% means ~10M URLs incorrectly marked as "seen"). Those pages get skipped. For a web crawler this is acceptable — skipping 1% of new URLs is far better than running out of memory.

4. Level 3 — Politeness: robots.txt + Crawl Delay

The politeness constraint is
not optional — Google was
nearly banned from the early
web for crawling too
aggressively. Respecting
robots.txt is both ethical
and practical.

Hammering a server with 400 requests/second will get your crawler blocked and possibly legally challenged. Two mechanisms govern polite crawling:

robots.txt — A convention (RFC 9309) that websites use to declare which paths are off-limits. Crawlers must fetch and cache it before visiting any page on a domain.

robots.txt — example
# Disallow admin and private areas
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /search?
Crawl-delay: 2

# Allow Googlebot full access
User-agent: Googlebot
Disallow:
Python — robots.txt parser + per-domain rate limiter
import time, heapq
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse

# Cache: domain → (RobotFileParser, crawl_delay)
robots_cache = {}

def get_robots(domain):
    if domain not in robots_cache:
        rp = RobotFileParser()
        rp.set_url("https://" + domain + "/robots.txt")
        rp.read()
        delay = rp.crawl_delay("*") or 1.0   # default 1 req/sec
        robots_cache[domain] = (rp, delay)
    return robots_cache[domain]

# Priority queue: (next_allowed_time, url)
frontier = []
next_time = {}   # domain → next allowed fetch timestamp

def enqueue(url):
    domain = urlparse(url).netloc
    t      = max(time.time(), next_time.get(domain, 0))
    heapq.heappush(frontier, (t, url))

def fetch_next():
    t, url = heapq.heappop(frontier)
    now    = time.time()
    if t > now:
        time.sleep(t - now)
    domain  = urlparse(url).netloc
    rp, delay = get_robots(domain)
    if not rp.can_fetch("*", url):
        return None       # disallowed by robots.txt
    next_time[domain] = time.time() + delay
    return url

5. Level 4 — Content Deduplication (Near-Duplicate Detection)

URL deduplication ensures we don’t visit the same URL twice. But what about two different URLs that serve identical or nearly identical content? (Mirrors, pagination, print views, trailing-slash variants.)

Exact duplicates are easy — compute MD5 of the response body:

Python
import hashlib

content_hashes = set()

def is_duplicate(html_bytes):
    h = hashlib.md5(html_bytes).hexdigest()
    if h in content_hashes:
        return True
    content_hashes.add(h)
    return False

Near-duplicates require SimHash — a locality-sensitive hashing technique where similar documents produce hashes that differ in few bits.

SimHash was invented by
Moses Charikar in 2002.
Google uses it to detect
near-duplicate web pages
across hundreds of billions
of documents.

The intuition: split the document into shingles (word n-grams), hash each, weight them by term frequency, then collapse into a 64-bit fingerprint. Documents sharing 90%+ content will have fingerprints differing in ≤ 3 bits.

SimHash comparison — "Article about crawlers" vs "Article about scrapers" (92% overlap)
Doc A SimHash (64-bit):
Doc B SimHash (3 bits differ — highlighted):
Hamming distance: 3 → near-duplicate (threshold ≤ 3)

6. Level 5 — Distributed Architecture

Single-node crawling is a dead end. At 400 pages/second we need 100+ fetcher workers. Here is the full distributed pipeline — click each component to learn its role, failure modes, and scale.

Distributed Crawler — Component Map (click to explore)
URL Frontier Fetcher Workers HTML Parser Link Extractor URL Filter Bloom Filter URL Frontier (loop)
Fetcher Workers Content Store (S3)   |   DNS Resolver Cache Fetcher Workers
URL Frontier
Fetcher Workers
DNS Resolver Cache
Content Store (S3)
HTML Parser
Link Extractor
URL Filter
Bloom Filter
Click a component to see its description, failure mode, and scale characteristics.

7. Level 6 — URL Frontier Priority & Freshness

Not all URLs are equal. A breaking news article should be re-crawled within minutes; a Wikipedia stub from 2009 can wait days. The frontier must support priority and freshness.

Priority is based on site importance (PageRank-like score) and content type:

Site typePriorityRe-crawl interval
Major news (CNN, BBC)High15 minutes
Wikipedia, referenceHigh1–7 days
Active blogsMedium1–7 days
Static corporate sitesMedium30 days
Parked domains / low-qualityLow90+ days

Freshness formula — used to schedule the next crawl time:

priority = site_importance × ( 1 / days_since_last_change )

Sites that rarely change (days_since_last_change → ∞) converge to priority ≈ 0 and sink to the bottom of the queue. Sites that update daily stay near the top.

Sitemap hints: Many sites publish sitemap.xml with <changefreq> and <lastmod> tags. The crawler reads these to pre-compute freshness scores without needing to fetch every page blindly.


8. Level 7 — Handling Failures & Traps

Spider Traps

A poorly designed website can create infinite URL spaces: /calendar/2024/01/01, /calendar/2024/01/02, …, /calendar/∞. Or parameterised URLs that combine into millions of variants.

Spider trap signatures: URL depth > 10 path segments; same domain with > 100,000 queued URLs; monotonically increasing numeric parameter in URL path.

Defences:

Python — trap detection
from urllib.parse import urlparse

MAX_DEPTH  = 10
MAX_DOMAIN = 100_000   # URLs queued per domain
domain_count = {}

def is_trap(url):
    p      = urlparse(url)
    depth  = p.path.count("/")
    domain = p.netloc
    if depth > MAX_DEPTH:
        return True
    domain_count[domain] = domain_count.get(domain, 0) + 1
    if domain_count[domain] > MAX_DOMAIN:
        return True
    return False

Retry Logic with Exponential Backoff

Python
import time, requests

def fetch_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            resp = requests.get(url, timeout=10)
            if resp.status_code == 429:            # Too Many Requests
                retry_after = int(resp.headers.get("Retry-After", 60))
                time.sleep(retry_after)
                continue
            if resp.status_code in (500, 502, 503, 504):
                time.sleep(2 ** attempt)            # exponential backoff
                continue
            return resp
        except requests.Timeout:
            time.sleep(2 ** attempt)
    return None    # give up after max_retries

9. Interactive: BFS Crawl Visualiser

Step through a mini web graph of 12 nodes. Watch how BFS discovers pages, queues new links, and how the Bloom filter catches already-visited URLs.

Visited: 0 Queued: 0 Bloom-filtered: 0 Current: seed
Click "Step" to begin crawling from the seed node…

10. Capacity Estimation

Metric Assumption Value
Total pagesTarget1,000,000,000
Avg page sizeHTML only100 KB
Raw HTML storage1B × 100KB100 TB
Required throughput1B ÷ 2,592,000s~400 pages/sec
Network bandwidth400 × 100KB~40 MB/s
Min fetcher threads400 pages/sec × 0.5s avg200 threads
Fetcher nodes100 threads/node~100 nodes
DNS lookups/sec (raw)400 req/sec × distinct domains~400/sec
DNS lookups/sec (cached)70% cache hit rate~120/sec
Bloom filter RAM (1B URLs)1% FPR, 9.6 bits/key1.2 GB
URL metadata DB1B × 200 bytes~200 GB
robots.txt cache50K domains × 2KB~100 MB
Rule of thumb: At search-engine scale, the bottleneck is almost never CPU — it's network I/O, DNS latency, and politeness constraints. Design the frontier and rate-limiter before worrying about parser throughput.

11. Scrapy vs. Custom Crawler

Feature Scrapy Custom (Distributed)
Setup time Hours Weeks–months
Scale ceiling ~1M pages/day (single node) Billions of pages/day
Politeness built-in Yes (AUTOTHROTTLE) Must implement manually
robots.txt support Yes Must implement manually
Custom URL priority Limited Full control
Distributed frontier No (use Scrapyd / Scrapy-Redis) Native design
Fault tolerance Basic (Scrapy-Redis helps) Designed in (leases, WAL)
Best for Targeted scraping, ≤ 10M pages Search engine, ≥ 100M pages

12. Summary

The web crawler is a deceptively rich system design problem. Each layer of scale breaks the previous solution:

Level Problem introduced Solution
1 Speed Multi-threading / multi-node
2 URL dedup memory Bloom filter
3 Server abuse robots.txt + per-domain rate limiting
4 Content dedup MD5 (exact) + SimHash (near-duplicate)
5 Single point of failure Distributed frontier + stateless fetchers
6 Stale content Priority queue + freshness scoring
7 Infinite traps Depth limit + domain URL cap + retry backoff

In an interview, walk through these levels in order. Mention the Bloom filter’s memory trade-off explicitly (interviewers love it). Draw the distributed pipeline with the feedback loop. And always name-check politeness — it signals real-world awareness that separates senior engineers from textbook answers.