user@techdebt:~/blog$_
$ cd ..

The Web Crawler

Frontier depth: 0 Workers W1 idle W2 idle W3 idle 0/10 req/s DUPLICATE DETECTED Visited 0 pages M-A M-A M-B Page Graph / /a /b /c p1 p2 404 NOT FOUND
mode:
$ crawler inspector
click Step or Play to inspect crawler state
$ simulation.log

the problem

You have a seed URL. Fetch the page, extract all the links, fetch those pages, extract their links, keep going. This is THE interview problem at companies like Anthropic and Databricks. It pulls in every concurrency concept from this series: thread pools, queues, locks, and rate limiting.

Start with the simplest version. Single-threaded, no concurrency:

import requests
from collections import deque

def crawl(seed_url, max_pages=100):
    visited = set()
    frontier = deque([seed_url])

    while frontier and len(visited) < max_pages:
        url = frontier.popleft()
        if url in visited:
            continue

        response = requests.get(url)
        visited.add(url)

        for link in extract_links(response.text):
            if link not in visited:
                frontier.append(link)

    return visited

This works. It’s also painfully slow. Each requests.get() blocks for 200-500ms waiting on the network. While we wait for one page to respond, we’re doing nothing. A hundred pages at 300ms each is 30 seconds of sitting idle. The CPU isn’t the bottleneck. The network is.

the concurrent version

The fix is obvious: fetch multiple pages at the same time. While one worker waits on a response, others can be fetching different URLs. This is exactly what thread pools are for.

import threading
from concurrent.futures import ThreadPoolExecutor
from queue import Queue

def concurrent_crawl(seed_url, max_pages=100, workers=5):
    visited = set()
    visited_lock = threading.Lock()
    frontier = Queue()
    frontier.put(seed_url)

    def worker():
        while True:
            url = frontier.get()
            try:
                with visited_lock:
                    if url in visited:
                        continue
                    visited.add(url)

                response = requests.get(url, timeout=5)
                for link in extract_links(response.text):
                    with visited_lock:
                        if link not in visited:
                            frontier.put(link)
            finally:
                frontier.task_done()

    with ThreadPoolExecutor(max_workers=workers) as pool:
        for _ in range(workers):
            pool.submit(worker)
        frontier.join()

    return visited

Three things to notice here. First, Queue is thread-safe. Its put() and get() methods handle locking internally, so we don’t need to wrap them. Second, visited is a plain set, and set operations aren’t atomic. Two workers could both check if url in visited, both see False, and both fetch the same page. The lock prevents that. Third, frontier.join() blocks until every item that was put() into the queue has been followed by a task_done() call. That’s how we know crawling is finished.

With 5 workers, those 100 pages at 300ms each take roughly 6 seconds instead of 30. With 10 workers, roughly 3 seconds. The speedup is nearly linear for IO-bound work like this.

rate limiting

Hammering a server with 10 concurrent requests is rude. It can also get your IP banned, trigger WAF rules, or cause a denial-of-service on smaller sites. Polite crawlers respect robots.txt and limit their request rate.

A simple rate limiter ensures a minimum delay between requests:

import time

class RateLimiter:
    def __init__(self, max_per_second):
        self.delay = 1.0 / max_per_second
        self.lock = threading.Lock()
        self.last_call = 0.0

    def wait(self):
        with self.lock:
            now = time.monotonic()
            wait_time = self.last_call + self.delay - now
            if wait_time > 0:
                time.sleep(wait_time)
            self.last_call = time.monotonic()

Drop a rate_limiter.wait() call before each requests.get() in the worker. The lock ensures that only one thread checks and updates last_call at a time. If two workers try to fetch simultaneously, one sleeps until enough time has passed.

At 5 requests per second with 10 workers, most workers are sleeping most of the time. That’s fine. They’re waiting on the rate limiter instead of waiting on the network. The point is controlling the load on the target server, not maximizing throughput.