System Design: Web Crawler — From BFS to Petabyte-Scale Crawling
Google’s crawler visits
trillions of pages and
re-crawls the entire
known web every few
days. Infrastructure
for that dwarfs most
national data centres.
A web crawler sounds almost trivially simple: follow links, download pages, repeat. Every computer science student has written one. Yet at the scale of a search engine — 1 billion pages, 1 month, fault-tolerant, polite — nearly every naïve assumption breaks spectacularly. This article dismantles the problem layer by layer, from a 30-line Python prototype to a distributed architecture that could actually power a real search index.
The question: Design a web crawler for a search engine. Crawl 1 billion pages in a month. Handle duplicates, politeness, failures, and freshness.
1. The Problem
A web crawler visits URLs, downloads HTML, extracts new links, enqueues them, and repeats. The algorithm — breadth-first search on the web graph — fits in a whiteboard sketch. The scale does not.
Every requirement the interviewer adds — deduplication, politeness, fault tolerance, freshness — forces a new architectural layer. Let’s build them one at a time.
2. Level 1 — Single-Threaded BFS
Start with the simplest possible implementation. It works for thousands of URLs, and it illustrates every bottleneck we’ll need to fix.
from collections import deque import requests from bs4 import BeautifulSoup from urllib.parse import urljoin, urlparse def crawl(seed_urls, max_pages=1000): queue = deque(seed_urls) visited = set() # ← naive dedup: O(n) memory pages = [] while queue and len(pages) < max_pages: url = queue.popleft() if url in visited: continue visited.add(url) try: resp = requests.get(url, timeout=5) if resp.status_code != 200: continue pages.append((url, resp.text)) soup = BeautifulSoup(resp.text, "html.parser") for tag in soup.find_all("a", href=True): link = urljoin(url, tag["href"]) if link not in visited: queue.append(link) except Exception: pass # no retry, no logging return pages
Bottlenecks at scale:
- Single thread → 1 page at a time. At 1s/page, 1B pages = 31 years.
visitedset at 1B URLs × ~100 bytes = 100 GB RAM just for deduplication.- No politeness — will get IP-banned within minutes.
- No content deduplication — mirrors get stored repeatedly.
3. Level 2 — URL Deduplication with Bloom Filters
A Bloom filter for 1B URLs
uses ~1.2 GB at 1% false
positive rate. Without it,
deduplication alone would
need 100 GB of RAM —
per crawler node.
The naïve HashSet approach is unusable at scale. Enter the Bloom filter — a probabilistic data structure that trades a small false-positive rate for enormous memory savings. (Covered in depth in Series #1.)
Memory comparison for 1 billion URLs:
# HashSet (Python dict) — exact dedup 1_000_000_000 urls × 100 bytes = 100 GB ← not viable per node # Bloom filter — probabilistic dedup (1% FPR) m = -(n × ln p) / (ln 2)² = -(1e9 × ln 0.01) / 0.480 ≈ 9.585 billion bits ≈ 1.2 GB ← 83× smaller
4. Level 3 — Politeness: robots.txt + Crawl Delay
The politeness constraint is
not optional — Google was
nearly banned from the early
web for crawling too
aggressively. Respecting
robots.txt is both ethical
and practical.
Hammering a server with 400 requests/second will get your crawler blocked and possibly legally challenged. Two mechanisms govern polite crawling:
robots.txt — A convention (RFC 9309) that websites use to declare which paths are off-limits. Crawlers must fetch and cache it before visiting any page on a domain.
# Disallow admin and private areas User-agent: * Disallow: /admin/ Disallow: /private/ Disallow: /search? Crawl-delay: 2 # Allow Googlebot full access User-agent: Googlebot Disallow:
import time, heapq from urllib.robotparser import RobotFileParser from urllib.parse import urlparse # Cache: domain → (RobotFileParser, crawl_delay) robots_cache = {} def get_robots(domain): if domain not in robots_cache: rp = RobotFileParser() rp.set_url("https://" + domain + "/robots.txt") rp.read() delay = rp.crawl_delay("*") or 1.0 # default 1 req/sec robots_cache[domain] = (rp, delay) return robots_cache[domain] # Priority queue: (next_allowed_time, url) frontier = [] next_time = {} # domain → next allowed fetch timestamp def enqueue(url): domain = urlparse(url).netloc t = max(time.time(), next_time.get(domain, 0)) heapq.heappush(frontier, (t, url)) def fetch_next(): t, url = heapq.heappop(frontier) now = time.time() if t > now: time.sleep(t - now) domain = urlparse(url).netloc rp, delay = get_robots(domain) if not rp.can_fetch("*", url): return None # disallowed by robots.txt next_time[domain] = time.time() + delay return url
5. Level 4 — Content Deduplication (Near-Duplicate Detection)
URL deduplication ensures we don’t visit the same URL twice. But what about two different URLs that serve identical or nearly identical content? (Mirrors, pagination, print views, trailing-slash variants.)
Exact duplicates are easy — compute MD5 of the response body:
import hashlib content_hashes = set() def is_duplicate(html_bytes): h = hashlib.md5(html_bytes).hexdigest() if h in content_hashes: return True content_hashes.add(h) return False
Near-duplicates require SimHash — a locality-sensitive hashing technique where similar documents produce hashes that differ in few bits.
SimHash was invented by
Moses Charikar in 2002.
Google uses it to detect
near-duplicate web pages
across hundreds of billions
of documents.
The intuition: split the document into shingles (word n-grams), hash each, weight them by term frequency, then collapse into a 64-bit fingerprint. Documents sharing 90%+ content will have fingerprints differing in ≤ 3 bits.
6. Level 5 — Distributed Architecture
Single-node crawling is a dead end. At 400 pages/second we need 100+ fetcher workers. Here is the full distributed pipeline — click each component to learn its role, failure modes, and scale.
7. Level 6 — URL Frontier Priority & Freshness
Not all URLs are equal. A breaking news article should be re-crawled within minutes; a Wikipedia stub from 2009 can wait days. The frontier must support priority and freshness.
Priority is based on site importance (PageRank-like score) and content type:
| Site type | Priority | Re-crawl interval |
|---|---|---|
| Major news (CNN, BBC) | High | 15 minutes |
| Wikipedia, reference | High | 1–7 days |
| Active blogs | Medium | 1–7 days |
| Static corporate sites | Medium | 30 days |
| Parked domains / low-quality | Low | 90+ days |
Freshness formula — used to schedule the next crawl time:
Sites that rarely change (days_since_last_change → ∞) converge to priority ≈ 0 and sink to the bottom of the queue. Sites that update daily stay near the top.
Sitemap hints: Many sites publish sitemap.xml with <changefreq> and <lastmod> tags. The crawler reads these to pre-compute freshness scores without needing to fetch every page blindly.
8. Level 7 — Handling Failures & Traps
Spider Traps
A poorly designed website can create infinite URL spaces: /calendar/2024/01/01, /calendar/2024/01/02, …, /calendar/∞. Or parameterised URLs that combine into millions of variants.
Defences:
from urllib.parse import urlparse MAX_DEPTH = 10 MAX_DOMAIN = 100_000 # URLs queued per domain domain_count = {} def is_trap(url): p = urlparse(url) depth = p.path.count("/") domain = p.netloc if depth > MAX_DEPTH: return True domain_count[domain] = domain_count.get(domain, 0) + 1 if domain_count[domain] > MAX_DOMAIN: return True return False
Retry Logic with Exponential Backoff
import time, requests def fetch_with_retry(url, max_retries=3): for attempt in range(max_retries): try: resp = requests.get(url, timeout=10) if resp.status_code == 429: # Too Many Requests retry_after = int(resp.headers.get("Retry-After", 60)) time.sleep(retry_after) continue if resp.status_code in (500, 502, 503, 504): time.sleep(2 ** attempt) # exponential backoff continue return resp except requests.Timeout: time.sleep(2 ** attempt) return None # give up after max_retries
9. Interactive: BFS Crawl Visualiser
Step through a mini web graph of 12 nodes. Watch how BFS discovers pages, queues new links, and how the Bloom filter catches already-visited URLs.
10. Capacity Estimation
| Metric | Assumption | Value |
|---|---|---|
| Total pages | Target | 1,000,000,000 |
| Avg page size | HTML only | 100 KB |
| Raw HTML storage | 1B × 100KB | 100 TB |
| Required throughput | 1B ÷ 2,592,000s | ~400 pages/sec |
| Network bandwidth | 400 × 100KB | ~40 MB/s |
| Min fetcher threads | 400 pages/sec × 0.5s avg | 200 threads |
| Fetcher nodes | 100 threads/node | ~100 nodes |
| DNS lookups/sec (raw) | 400 req/sec × distinct domains | ~400/sec |
| DNS lookups/sec (cached) | 70% cache hit rate | ~120/sec |
| Bloom filter RAM (1B URLs) | 1% FPR, 9.6 bits/key | 1.2 GB |
| URL metadata DB | 1B × 200 bytes | ~200 GB |
| robots.txt cache | 50K domains × 2KB | ~100 MB |
11. Scrapy vs. Custom Crawler
| Feature | Scrapy | Custom (Distributed) |
|---|---|---|
| Setup time | Hours | Weeks–months |
| Scale ceiling | ~1M pages/day (single node) | Billions of pages/day |
| Politeness built-in | Yes (AUTOTHROTTLE) | Must implement manually |
| robots.txt support | Yes | Must implement manually |
| Custom URL priority | Limited | Full control |
| Distributed frontier | No (use Scrapyd / Scrapy-Redis) | Native design |
| Fault tolerance | Basic (Scrapy-Redis helps) | Designed in (leases, WAL) |
| Best for | Targeted scraping, ≤ 10M pages | Search engine, ≥ 100M pages |
12. Summary
The web crawler is a deceptively rich system design problem. Each layer of scale breaks the previous solution:
| Level | Problem introduced | Solution |
|---|---|---|
| 1 | Speed | Multi-threading / multi-node |
| 2 | URL dedup memory | Bloom filter |
| 3 | Server abuse | robots.txt + per-domain rate limiting |
| 4 | Content dedup | MD5 (exact) + SimHash (near-duplicate) |
| 5 | Single point of failure | Distributed frontier + stateless fetchers |
| 6 | Stale content | Priority queue + freshness scoring |
| 7 | Infinite traps | Depth limit + domain URL cap + retry backoff |
In an interview, walk through these levels in order. Mention the Bloom filter’s memory trade-off explicitly (interviewers love it). Draw the distributed pipeline with the feedback loop. And always name-check politeness — it signals real-world awareness that separates senior engineers from textbook answers.