Web Crawler Multithreaded - Problem

Imagine you're building a web crawler that needs to explore all pages within a specific website domain as quickly as possible. You have a starting URL and need to discover all linked pages that belong to the same hostname, but there's a catch - single-threaded crawling is too slow!

Given a startUrl and an HtmlParser interface, implement a multi-threaded web crawler that:

  • ๐ŸŒ Starts crawling from startUrl
  • ๐Ÿ“„ Uses HtmlParser.getUrls(url) to extract all URLs from each page
  • ๐Ÿšซ Never crawls the same URL twice (avoid infinite loops)
  • ๐Ÿ  Only explores URLs with the same hostname as the starting URL
  • โšก Utilizes multiple threads for concurrent crawling

Hostname Rules: URLs http://leetcode.com/problems and http://leetcode.com/contest share the same hostname (leetcode.com), but http://example.org/test and http://example.com/abc have different hostnames.

The HtmlParser interface is:

interface HtmlParser {
    // Returns all URLs found on the given webpage
    // This is a blocking HTTP request (takes ~15ms max)
    public List<String> getUrls(String url);
}

Challenge: Single-threaded solutions will exceed the time limit. Can your multi-threaded approach crawl faster by processing multiple pages simultaneously?

Input & Output

example_1.py โ€” Basic Tree Structure
$ Input: startUrl = "http://news.yahoo.com/news/topics/" urls = ["http://news.yahoo.com/news/topics/", "http://news.yahoo.com/news/topics/1", "http://news.yahoo.com/news/topics/2"] edges = [[0,1],[0,2]] (URLs 0->1, 0->2 means URL 0 links to URLs 1 and 2)
โ€บ Output: ["http://news.yahoo.com/news/topics/", "http://news.yahoo.com/news/topics/1", "http://news.yahoo.com/news/topics/2"]
๐Ÿ’ก Note: Starting from the root URL, we discover and crawl all linked pages within the same hostname (news.yahoo.com). The multi-threaded approach processes multiple URLs concurrently.
example_2.py โ€” Complex Network
$ Input: startUrl = "http://news.yahoo.com/news/topics/" urls = ["http://news.yahoo.com/news/topics/", "http://news.yahoo.com/news/topics/1", "http://news.yahoo.com/news/topics/2", "http://news.google.com"] edges = [[0,1],[0,2],[1,3],[2,3]] (Cross-links between pages, including external domain)
โ€บ Output: ["http://news.yahoo.com/news/topics/", "http://news.yahoo.com/news/topics/1", "http://news.yahoo.com/news/topics/2"]
๐Ÿ’ก Note: Even though URLs 1 and 2 link to google.com, we only crawl URLs with the same hostname (yahoo.com). External links are filtered out.
example_3.py โ€” Circular References
$ Input: startUrl = "http://example.com/page1" urls = ["http://example.com/page1", "http://example.com/page2", "http://example.com/page3"] edges = [[0,1],[1,2],[2,0]] (Circular reference: page1->page2->page3->page1)
โ€บ Output: ["http://example.com/page1", "http://example.com/page2", "http://example.com/page3"]
๐Ÿ’ก Note: Despite circular references, each URL is crawled exactly once due to the visited set. Multi-threading safely handles concurrent access to shared data structures.

Constraints

  • 1 โ‰ค urls.length โ‰ค 1000
  • 1 โ‰ค urls[i].length โ‰ค 300
  • startUrl is one of the urls
  • All URLs follow the format: http://hostname/path
  • HtmlParser.getUrls(url) returns URLs within 15ms
  • Single-threaded solutions will exceed time limit

Visualization

Tap to expand
Thread Pool ManagerURL QueueWorker 1Worker 2Worker 3Worker 4HTML ParserVisited SetResultsSYNC๐Ÿš€ Concurrent Web CrawlingMultiple threads process URLs simultaneously with safe coordinationโšก Significant Speed Improvement!
Understanding the Visualization
1
Initialize Resources
Create shared queue, visited set, and launch worker threads
2
Parallel Processing
Each thread takes URLs from queue and processes them concurrently
3
Coordinate Updates
Threads safely add discovered URLs to queue and update shared state
4
Detect Completion
Workers coordinate to detect when all URLs have been processed
Key Takeaway
๐ŸŽฏ Key Insight: Multi-threading dramatically improves crawling performance by processing multiple URLs concurrently while using proper synchronization to prevent race conditions and ensure correctness.
Asked in
Google 45 Amazon 38 Meta 32 Microsoft 28
58.0K Views
High Frequency
~25 min Avg. Time
1.7K Likes
Ln 1, Col 1
Smart Actions
๐Ÿ’ก Explanation
AI Ready
๐Ÿ’ก Suggestion Tab to accept Esc to dismiss
// Output will appear here after running code
Code Editor Closed
Click the red button to reopen