System Design: Web Crawler — From BFS to Petabyte-Scale Crawling

Series System Design Interview Prep — #7 of 15

Google’s crawler visits
trillions of pages and
re-crawls the entire
known web every few
days. Infrastructure
for that dwarfs most
national data centres.

A web crawler sounds almost trivially simple: follow links, download pages, repeat. Every computer science student has written one. Yet at the scale of a search engine — 1 billion pages, 1 month, fault-tolerant, polite — nearly every naïve assumption breaks spectacularly. This article dismantles the problem layer by layer, from a 30-line Python prototype to a distributed architecture that could actually power a real search index.

The question: Design a web crawler for a search engine. Crawl 1 billion pages in a month. Handle duplicates, politeness, failures, and freshness.

1. The Problem

A web crawler visits URLs, downloads HTML, extracts new links, enqueues them, and repeats. The algorithm — breadth-first search on the web graph — fits in a whiteboard sketch. The scale does not.

pages / month

target

~400

pages / second

1B ÷ 30 days

40 MB/s

raw download

@100KB / page

~100TB

storage / month

HTML + metadata

Every requirement the interviewer adds — deduplication, politeness, fault tolerance, freshness — forces a new architectural layer. Let’s build them one at a time.

2. Level 1 — Single-Threaded BFS

Start with the simplest possible implementation. It works for thousands of URLs, and it illustrates every bottleneck we’ll need to fix.

Python

from collections import deque
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

def crawl(seed_urls, max_pages=1000):
    queue   = deque(seed_urls)
    visited = set()          # ← naive dedup: O(n) memory
    pages   = []

    while queue and len(pages) < max_pages:
        url = queue.popleft()
        if url in visited:
            continue
        visited.add(url)

        try:
            resp = requests.get(url, timeout=5)
            if resp.status_code != 200:
                continue
            pages.append((url, resp.text))
            soup = BeautifulSoup(resp.text, "html.parser")
            for tag in soup.find_all("a", href=True):
                link = urljoin(url, tag["href"])
                if link not in visited:
                    queue.append(link)
        except Exception:
            pass          # no retry, no logging

    return pages

Bottlenecks at scale:

Single thread → 1 page at a time. At 1s/page, 1B pages = 31 years.
visited set at 1B URLs × ~100 bytes = 100 GB RAM just for deduplication.
No politeness — will get IP-banned within minutes.
No content deduplication — mirrors get stored repeatedly.

3. Level 2 — URL Deduplication with Bloom Filters

A Bloom filter for 1B URLs
uses ~1.2 GB at 1% false
positive rate. Without it,
deduplication alone would
need 100 GB of RAM —
per crawler node.

The naïve HashSet approach is unusable at scale. Enter the Bloom filter — a probabilistic data structure that trades a small false-positive rate for enormous memory savings. (Covered in depth in Series #1.)

Memory comparison for 1 billion URLs:

Memory model

# HashSet (Python dict) — exact dedup
1_000_000_000 urls × 100 bytes = 100 GB   ← not viable per node

# Bloom filter — probabilistic dedup (1% FPR)
m = -(n × ln p) / (ln 2)²
  = -(1e9 × ln 0.01) / 0.480
  ≈ 9.585 billion bits ≈ 1.2 GB   ← 83× smaller

Memory footprint — 1 billion URLs

HashSet

100 GB

Bloom filter

1.2 GB

Trade-off: A Bloom filter can produce false positives (1% means ~10M URLs incorrectly marked as "seen"). Those pages get skipped. For a web crawler this is acceptable — skipping 1% of new URLs is far better than running out of memory.

4. Level 3 — Politeness: robots.txt + Crawl Delay

The politeness constraint is
not optional — Google was
nearly banned from the early
web for crawling too
aggressively. Respecting
robots.txt is both ethical
and practical.

Hammering a server with 400 requests/second will get your crawler blocked and possibly legally challenged. Two mechanisms govern polite crawling:

robots.txt — A convention (RFC 9309) that websites use to declare which paths are off-limits. Crawlers must fetch and cache it before visiting any page on a domain.

robots.txt — example

# Disallow admin and private areas
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /search?
Crawl-delay: 2

# Allow Googlebot full access
User-agent: Googlebot
Disallow:

Python — robots.txt parser + per-domain rate limiter

import time, heapq
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse

# Cache: domain → (RobotFileParser, crawl_delay)
robots_cache = {}

def get_robots(domain):
    if domain not in robots_cache:
        rp = RobotFileParser()
        rp.set_url("https://" + domain + "/robots.txt")
        rp.read()
        delay = rp.crawl_delay("*") or 1.0   # default 1 req/sec
        robots_cache[domain] = (rp, delay)
    return robots_cache[domain]

# Priority queue: (next_allowed_time, url)
frontier = []
next_time = {}   # domain → next allowed fetch timestamp

def enqueue(url):
    domain = urlparse(url).netloc
    t      = max(time.time(), next_time.get(domain, 0))
    heapq.heappush(frontier, (t, url))

def fetch_next():
    t, url = heapq.heappop(frontier)
    now    = time.time()
    if t > now:
        time.sleep(t - now)
    domain  = urlparse(url).netloc
    rp, delay = get_robots(domain)
    if not rp.can_fetch("*", url):
        return None       # disallowed by robots.txt
    next_time[domain] = time.time() + delay
    return url

5. Level 4 — Content Deduplication (Near-Duplicate Detection)

URL deduplication ensures we don’t visit the same URL twice. But what about two different URLs that serve identical or nearly identical content? (Mirrors, pagination, print views, trailing-slash variants.)

Exact duplicates are easy — compute MD5 of the response body:

Python

import hashlib

content_hashes = set()

def is_duplicate(html_bytes):
    h = hashlib.md5(html_bytes).hexdigest()
    if h in content_hashes:
        return True
    content_hashes.add(h)
    return False

Near-duplicates require SimHash — a locality-sensitive hashing technique where similar documents produce hashes that differ in few bits.

SimHash was invented by
Moses Charikar in 2002.
Google uses it to detect
near-duplicate web pages
across hundreds of billions
of documents.

The intuition: split the document into shingles (word n-grams), hash each, weight them by term frequency, then collapse into a 64-bit fingerprint. Documents sharing 90%+ content will have fingerprints differing in ≤ 3 bits.

SimHash comparison — "Article about crawlers" vs "Article about scrapers" (92% overlap)

Doc A SimHash (64-bit):

Doc B SimHash (3 bits differ — highlighted):

Hamming distance: 3 → near-duplicate (threshold ≤ 3)

6. Level 5 — Distributed Architecture

Single-node crawling is a dead end. At 400 pages/second we need 100+ fetcher workers. Here is the full distributed pipeline — click each component to learn its role, failure modes, and scale.

Distributed Crawler — Component Map (click to explore)

URL Frontier → Fetcher Workers → HTML Parser → Link Extractor → URL Filter → Bloom Filter → URL Frontier (loop)

Fetcher Workers → Content Store (S3) | DNS Resolver Cache → Fetcher Workers

URL Frontier

Fetcher Workers

DNS Resolver Cache

Content Store (S3)

HTML Parser

Link Extractor

URL Filter

Bloom Filter

Click a component to see its description, failure mode, and scale characteristics.

7. Level 6 — URL Frontier Priority & Freshness

Not all URLs are equal. A breaking news article should be re-crawled within minutes; a Wikipedia stub from 2009 can wait days. The frontier must support priority and freshness.

Priority is based on site importance (PageRank-like score) and content type:

Site type	Priority	Re-crawl interval
Major news (CNN, BBC)	High	15 minutes
Wikipedia, reference	High	1–7 days
Active blogs	Medium	1–7 days
Static corporate sites	Medium	30 days
Parked domains / low-quality	Low	90+ days

Freshness formula — used to schedule the next crawl time:

priority = site_importance × ( 1 / days_since_last_change )

Sites that rarely change (days_since_last_change → ∞) converge to priority ≈ 0 and sink to the bottom of the queue. Sites that update daily stay near the top.

Sitemap hints: Many sites publish sitemap.xml with <changefreq> and <lastmod> tags. The crawler reads these to pre-compute freshness scores without needing to fetch every page blindly.

8. Level 7 — Handling Failures & Traps

Spider Traps

A poorly designed website can create infinite URL spaces: /calendar/2024/01/01, /calendar/2024/01/02, …, /calendar/∞. Or parameterised URLs that combine into millions of variants.

Spider trap signatures: URL depth > 10 path segments; same domain with > 100,000 queued URLs; monotonically increasing numeric parameter in URL path.

Defences:

Python — trap detection

from urllib.parse import urlparse

MAX_DEPTH  = 10
MAX_DOMAIN = 100_000   # URLs queued per domain
domain_count = {}

def is_trap(url):
    p      = urlparse(url)
    depth  = p.path.count("/")
    domain = p.netloc
    if depth > MAX_DEPTH:
        return True
    domain_count[domain] = domain_count.get(domain, 0) + 1
    if domain_count[domain] > MAX_DOMAIN:
        return True
    return False

Retry Logic with Exponential Backoff

Python

import time, requests

def fetch_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            resp = requests.get(url, timeout=10)
            if resp.status_code == 429:            # Too Many Requests
                retry_after = int(resp.headers.get("Retry-After", 60))
                time.sleep(retry_after)
                continue
            if resp.status_code in (500, 502, 503, 504):
                time.sleep(2 ** attempt)            # exponential backoff
                continue
            return resp
        except requests.Timeout:
            time.sleep(2 ** attempt)
    return None    # give up after max_retries

9. Interactive: BFS Crawl Visualiser

Step through a mini web graph of 12 nodes. Watch how BFS discovers pages, queues new links, and how the Bloom filter catches already-visited URLs.

Visited: 0 Queued: 0 Bloom-filtered: 0 Current: seed

Click "Step" to begin crawling from the seed node…

10. Capacity Estimation

Metric	Assumption	Value
Total pages	Target	1,000,000,000
Avg page size	HTML only	100 KB
Raw HTML storage	1B × 100KB	100 TB
Required throughput	1B ÷ 2,592,000s	~400 pages/sec
Network bandwidth	400 × 100KB	~40 MB/s
Min fetcher threads	400 pages/sec × 0.5s avg	200 threads
Fetcher nodes	100 threads/node	~100 nodes
DNS lookups/sec (raw)	400 req/sec × distinct domains	~400/sec
DNS lookups/sec (cached)	70% cache hit rate	~120/sec
Bloom filter RAM (1B URLs)	1% FPR, 9.6 bits/key	1.2 GB
URL metadata DB	1B × 200 bytes	~200 GB
robots.txt cache	50K domains × 2KB	~100 MB

Rule of thumb: At search-engine scale, the bottleneck is almost never CPU — it's network I/O, DNS latency, and politeness constraints. Design the frontier and rate-limiter before worrying about parser throughput.

11. Scrapy vs. Custom Crawler

Feature	Scrapy	Custom (Distributed)
Setup time	Hours	Weeks–months
Scale ceiling	~1M pages/day (single node)	Billions of pages/day
Politeness built-in	Yes (AUTOTHROTTLE)	Must implement manually
robots.txt support	Yes	Must implement manually
Custom URL priority	Limited	Full control
Distributed frontier	No (use Scrapyd / Scrapy-Redis)	Native design
Fault tolerance	Basic (Scrapy-Redis helps)	Designed in (leases, WAL)
Best for	Targeted scraping, ≤ 10M pages	Search engine, ≥ 100M pages

12. Summary

The web crawler is a deceptively rich system design problem. Each layer of scale breaks the previous solution:

Level	Problem introduced	Solution
1	Speed	Multi-threading / multi-node
2	URL dedup memory	Bloom filter
3	Server abuse	robots.txt + per-domain rate limiting
4	Content dedup	MD5 (exact) + SimHash (near-duplicate)
5	Single point of failure	Distributed frontier + stateless fetchers
6	Stale content	Priority queue + freshness scoring
7	Infinite traps	Depth limit + domain URL cap + retry backoff

In an interview, walk through these levels in order. Mention the Bloom filter’s memory trade-off explicitly (interviewers love it). Draw the distributed pipeline with the feedback loop. And always name-check politeness — it signals real-world awareness that separates senior engineers from textbook answers.