Crawlware - Seravia's Deep Web Crawling System
Presented by
邹志乐
[email protected]
敬宓
[email protected]
Agenda
● What is Crawlware?
● Crawlware Architecture
● Job Model
● Payload Generation and Scheduling
● Rate Control
● Auto Deduplication
● Crawler testing with Sinatra
● Some problems we encountered & TODOs
What is Crawlware?
Crawlware is a distributed deep web crawling system, which enables
scalable and friendly crawls of the data that must be retrieved with
complex queries.
● Distributed: Execute cross multiple machines
● Scalable: Scale up by adding extra machines and bandwidth.
● Efficiency: Efficient use of various system resources
● Extensible: Be extensible for new data formats, protocols, etc
● Freshness: Be able to capture data changes
● Continuous: Continuous crawling without administrators' operation.
● Generic: Each crawling worker can crawl any given sites
● Parallelization: Crawl all websites in parellel
● Anti-blocking: Precise rate control
A General Crawler Architecture
From <<Introduction to Information Retrieval>>
Robots
Doc FP's Templates URLSet
DNS
WWW
Parse
Dedup
Content
URL Filter URL
Fetch Seen?
Elim
URL Frontier
Crawlware Architecture
High Level Working Flows
Crawlware Architecture
Job Model
● XML syntax ● HTTP Get, Post, Next Page
● An assembly of reusable actions ● Page Extractor, Link Extractor
● A context shared at runtime ● File Storage
● Job & Actions customized through ● Assignment
properties ● Code Snippet
Job Model Sample
Payload Generation & Scheduling
KeyGeneratorIncrement
Crawl job 1 KeyGeneratorDecrement
Crawl job 2
KeyGeneratorFile
Crawl job 3
Crawl job 4 KeyGeneratorDecorator
Key Generator KeyGeneratorDate
...... KeyGeneratorDateRange
Config
1) Push KeyGeneratorCustomRange
Crawl job 5 KeyGeneratorComposite files
payloads
Job DB 2) Load payloads Read frequency settings
Scheduler
Redis Queue 3) Push payload blocks
Crawl job 1 Crawl job 1 Crawl job 2 Crawl job 3 ...... Crawl job 4 Payload Block
1024 Payloads
Rate Control
Scheduler
● Site frequency configuration
●A given site's payloads amount in
payload block is determined by the
crawling frequency.
Redis Queue
● Scheduler controls the crawling
Pull payload blocks Pull payload blocks rate of the entire system (N crawler
nodes/IPs)
Worker Controller Worker Controller ●Worker Controller controls the
crawling rate of a single node/IP
Pull payloads Pull payloads
In-mem Queue In-mem Queue
Worker Worker Worker Worker Worker Worker
Auto Deduplication
Dedup DB Job DB
Get unique links and
push them in dedup db
Dedup rpc Job DB rpc
Push unique links into Job DB
Dedup links
Crawl brief page
Extract links
Crawl Job
Crawler Testing with Sinatra
● What is Sinatra
Crawler Testing
●
● Simulate various crawling actions via http, such as Get, Post,
NextPage.
● Simulate job profiles
Encountered Problems & TODOs
● Changing site load/performance
● Monitoring
● Dynamic rate switch based on time zone
●
Page Correctness
● Page tracker – continuous errors or identical pages
●
Data Freshness
● Scheduled updates
● Crawl delta for ID or date range payloads
● Recrawl for keyword payloads
●
Javascript
Thank You
Please contact [email protected] for job opportunities.