The Web Crawler | techdebt.blog

mode:

click Step or Play to inspect crawler state

the problem

You have a seed URL. Fetch the page, extract all the links, fetch those pages, extract their links, keep going. This is THE interview problem at companies like Anthropic and Databricks. It pulls in every concurrency concept from this series: thread pools, queues, locks, and rate limiting.

Start with the simplest version. Single-threaded, no concurrency:

import requests
from collections import deque

def crawl(seed_url, max_pages=100):
    visited = set()
    frontier = deque([seed_url])

    while frontier and len(visited) < max_pages:
        url = frontier.popleft()
        if url in visited:
            continue

        response = requests.get(url)
        visited.add(url)

        for link in extract_links(response.text):
            if link not in visited:
                frontier.append(link)

    return visited

This works. It’s also painfully slow. Each requests.get() blocks for 200-500ms waiting on the network. While we wait for one page to respond, we’re doing nothing. A hundred pages at 300ms each is 30 seconds of sitting idle. The CPU isn’t the bottleneck. The network is.

the concurrent version

The fix is obvious: fetch multiple pages at the same time. While one worker waits on a response, others can be fetching different URLs. This is exactly what thread pools are for.

import threading
from concurrent.futures import ThreadPoolExecutor
from queue import Queue

def concurrent_crawl(seed_url, max_pages=100, workers=5):
    visited = set()
    visited_lock = threading.Lock()
    frontier = Queue()
    frontier.put(seed_url)

    def worker():
        while True:
            url = frontier.get()
            try:
                with visited_lock:
                    if url in visited:
                        continue
                    visited.add(url)

                response = requests.get(url, timeout=5)
                for link in extract_links(response.text):
                    with visited_lock:
                        if link not in visited:
                            frontier.put(link)
            finally:
                frontier.task_done()

    with ThreadPoolExecutor(max_workers=workers) as pool:
        for _ in range(workers):
            pool.submit(worker)
        frontier.join()

    return visited

Three things to notice here. First, Queue is thread-safe. Its put() and get() methods handle locking internally, so we don’t need to wrap them. Second, visited is a plain set, and set operations aren’t atomic. Two workers could both check if url in visited, both see False, and both fetch the same page. The lock prevents that. Third, frontier.join() blocks until every item that was put() into the queue has been followed by a task_done() call. That’s how we know crawling is finished.

With 5 workers, those 100 pages at 300ms each take roughly 6 seconds instead of 30. With 10 workers, roughly 3 seconds. The speedup is nearly linear for IO-bound work like this.

rate limiting

Hammering a server with 10 concurrent requests is rude. It can also get your IP banned, trigger WAF rules, or cause a denial-of-service on smaller sites. Polite crawlers respect robots.txt and limit their request rate.

A simple rate limiter ensures a minimum delay between requests:

import time

class RateLimiter:
    def __init__(self, max_per_second):
        self.delay = 1.0 / max_per_second
        self.lock = threading.Lock()
        self.last_call = 0.0

    def wait(self):
        with self.lock:
            now = time.monotonic()
            wait_time = self.last_call + self.delay - now
            if wait_time > 0:
                time.sleep(wait_time)
            self.last_call = time.monotonic()

Drop a rate_limiter.wait() call before each requests.get() in the worker. The lock ensures that only one thread checks and updates last_call at a time. If two workers try to fetch simultaneously, one sleeps until enough time has passed.

At 5 requests per second with 10 workers, most workers are sleeping most of the time. That’s fine. They’re waiting on the rate limiter instead of waiting on the network. The point is controlling the load on the target server, not maximizing throughput.

mode:

click Step or Play to inspect crawler state

the problem

Fetch a page, extract links, repeat. Single-threaded is simple but slow (each fetch blocks 200-500ms). Make it concurrent with a thread pool: workers pull URLs from a Queue, check a lock-protected visited set, fetch pages, and push discovered links back into the queue.

the concurrent version

import threading
from concurrent.futures import ThreadPoolExecutor
from queue import Queue

def concurrent_crawl(seed_url, max_pages=100, workers=5):
    visited = set()
    visited_lock = threading.Lock()
    frontier = Queue()
    frontier.put(seed_url)

    def worker():
        while True:
            url = frontier.get()
            try:
                with visited_lock:
                    if url in visited:
                        continue
                    visited.add(url)

                response = requests.get(url, timeout=5)
                for link in extract_links(response.text):
                    with visited_lock:
                        if link not in visited:
                            frontier.put(link)
            finally:
                frontier.task_done()

    with ThreadPoolExecutor(max_workers=workers) as pool:
        for _ in range(workers):
            pool.submit(worker)
        frontier.join()

    return visited

Queue is thread-safe internally. visited needs a lock because set operations aren’t atomic. frontier.join() blocks until all queued URLs have been processed.

rate limiting

A RateLimiter with a lock-protected timestamp ensures minimum delay between requests. Call rate_limiter.wait() before each fetch. Polite crawling matters: respect robots.txt, avoid overwhelming servers, and don’t get your IP banned.

+ the interview version

This is THE Anthropic interview problem. Here’s how it typically unfolds. You’re asked to design a web crawler. Start single-threaded. The interviewer says “this is too slow.” You add concurrency with a thread pool. They ask about duplicate URLs. You add the visited set with a lock. They ask about overwhelming servers. You add rate limiting.

Common follow-ups after the base solution:

robots.txt compliance. Parse the robots.txt file for each domain before crawling. Cache the parsed rules per domain. Respect Crawl-delay directives. This means your rate limiter needs to be per-domain, not global.

Depth limiting. Track how many hops each URL is from the seed. Stop adding links beyond a maximum depth. Store (url, depth) tuples in the frontier instead of bare URLs.

Handling redirects. A 301 gives you a new URL. Add it to the frontier, but mark the original as visited. Watch for redirect loops (A redirects to B, B redirects to A). Cap redirect chains at 5-10 hops.

Cycle detection. The link graph has cycles. That’s what the visited set handles. But if you’re doing depth-limited crawls, you might visit the same page at different depths. Decide whether “visited” means “ever seen” or “seen at this depth or shallower.”

The key insight interviewers look for: you recognize the concurrency primitives and apply them correctly. Thread pool for parallelism. Queue for the frontier. Lock for shared state. Rate limiter for politeness.

+ scaling to a cluster

One machine can crawl maybe 100 pages per second. The web has billions of pages. You need multiple machines.

Consistent hashing distributes URLs across machines. Hash the URL’s domain, map it to a ring, and the machine owning that segment of the ring crawls it. Each machine is responsible for a range of domains.

When Machine A discovers a URL whose domain hashes to Machine B, it sends the URL to Machine B over the network. Each machine runs the single-machine crawler locally. The frontier and visited set are per-machine. No shared state between machines.

Challenges pile up quickly. Network partitions mean machines can’t communicate. You need to buffer URLs locally and retry. When a machine joins or leaves, the hash ring changes and URL ownership shifts. You need to transfer crawl state (visited sets, frontier queues) for the affected ranges.

DNS resolution becomes a bottleneck. Every new domain requires a DNS lookup (50-200ms). Cache aggressively and prefetch DNS for domains you’re about to crawl. At scale, you might run your own DNS resolver.

+ handling failures

Networks fail. Servers fail. Your crawler needs to handle it gracefully.

Retry with exponential backoff. First retry after 1 second, then 2, then 4, then 8. Add jitter (random delay) so all workers don’t retry at the same instant. Cap at a maximum delay (60 seconds).

import random

def fetch_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=5)
            response.raise_for_status()
            return response
        except requests.RequestException:
            if attempt == max_retries - 1:
                return None
            delay = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Circuit breaker pattern. If a domain fails 5 times in a row, stop trying it for a cooldown period (5 minutes). After the cooldown, try one request. If it succeeds, resume crawling. If it fails, extend the cooldown.

HTTP status handling. 404 means the page is gone. Mark it visited and move on. 429 (Too Many Requests) means slow down. Respect the Retry-After header. 503 (Service Unavailable) means try later. 301/302 means follow the redirect.

Dead letter queue. URLs that fail after all retries go into a separate queue for manual inspection or later retry. Don’t lose them silently. Don’t retry them forever either.

mode:

click Step or Play to inspect crawler state

The simulation shows the basic mechanics: workers pulling URLs, fetching pages, discovering links. A toy version of something that, at scale, becomes one of the hardest distributed systems problems in computing. What follows is what real-world crawlers actually look like.

the Mercator architecture

In 2003, researchers at Compaq published the Mercator paper describing a scalable web crawler architecture. Google’s crawler uses similar principles. The key insight is that politeness, not bandwidth, is the binding constraint.

A naive crawler has one frontier queue. Workers grab the next URL regardless of domain. With 100 workers and bad luck, all 100 could be hitting the same server simultaneously. That’s a denial-of-service attack, not a crawler.

Mercator splits the frontier into two layers. The front queues handle priority. High-priority URLs (frequently updated news sites, important pages) go into separate queues from low-priority ones (deep archive pages, rarely changing content). A prioritizer assigns incoming URLs to the appropriate front queue.

The back queues handle politeness. There’s one back queue per domain (or per IP address). A routing function maps each URL from the front queues to the correct back queue based on its domain. Each back queue has a timestamp indicating when that domain can next be contacted. A worker selects the back queue whose timestamp has passed and whose queue is non-empty.

This architecture guarantees that at most one request is in flight to any given domain at any time. You can increase total throughput by adding workers, and each new worker fetches from a different domain. Parallelism increases without any single server bearing more load.

DNS resolution becomes a real bottleneck at this scale. Every new domain requires a lookup, and standard resolvers can’t keep up with thousands of new domains per second. Mercator runs its own caching DNS resolver. It prefetches DNS records for domains in the back queues before their timestamps come up. The goal is that by the time a worker picks a URL, its IP address is already cached.

The politeness constraint fundamentally changes the scheduling problem. You can’t just fetch the next highest-priority URL. You have to fetch the next highest-priority URL from a domain you haven’t contacted recently. Priority and politeness are in constant tension, and the two-layer queue architecture is how you manage that tension without sacrificing either.

incremental crawling

A full crawl of the web takes weeks. By the time you finish, the pages you crawled first are already stale. Incremental crawling asks a different question: which pages should you re-crawl, and when?

The simplest strategy is a fixed interval. Re-crawl everything every 7 days. This wastes resources on pages that haven’t changed and misses updates on pages that change hourly.

Adaptive strategies track change frequency per page. If a page changed the last 3 times you checked it, check it more often. If it hasn’t changed in 6 months, check it less. You can model this with a Poisson process: estimate the rate parameter from observed changes and schedule the next crawl based on the expected time until the next change.

Priority-based strategies crawl important pages more often regardless of change frequency. A news site’s front page gets crawled every few minutes. A personal blog’s archive page gets crawled monthly. Importance can come from PageRank, traffic data, or domain authority.

The freshness-coverage tradeoff is unavoidable. Every request spent re-crawling an unchanged page is a request not spent discovering new pages. Every request spent exploring new pages is a request not spent keeping existing data fresh. There’s no configuration that maximizes both.

This connects to a broader distributed systems theme. Freshness is a form of consistency: how closely does your index match the current state of the web? Coverage is a form of availability: how much of the web can your index answer queries about? You’re trading one for the other, just like every distributed system trades consistency against availability. The crawler’s scheduling algorithm is, at its core, a policy decision about where on that spectrum you want to sit.