Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
45 views14 pages

Vdocuments - in Crawlware

Uploaded by

Ljiljana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views14 pages

Vdocuments - in Crawlware

Uploaded by

Ljiljana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Crawlware - Seravia's Deep Web Crawling System

Presented by

邹志乐
[email protected]

敬宓
[email protected]
Agenda
● What is Crawlware?
● Crawlware Architecture
● Job Model
● Payload Generation and Scheduling
● Rate Control
● Auto Deduplication
● Crawler testing with Sinatra
● Some problems we encountered & TODOs
What is Crawlware?
Crawlware is a distributed deep web crawling system, which enables
scalable and friendly crawls of the data that must be retrieved with
complex queries.
● Distributed: Execute cross multiple machines
● Scalable: Scale up by adding extra machines and bandwidth.
● Efficiency: Efficient use of various system resources
● Extensible: Be extensible for new data formats, protocols, etc
● Freshness: Be able to capture data changes
● Continuous: Continuous crawling without administrators' operation.
● Generic: Each crawling worker can crawl any given sites
● Parallelization: Crawl all websites in parellel
● Anti-blocking: Precise rate control
A General Crawler Architecture
From <<Introduction to Information Retrieval>>
Robots
Doc FP's Templates URLSet

DNS

WWW
Parse
Dedup
Content
URL Filter URL
Fetch Seen?
Elim

URL Frontier
Crawlware Architecture
High Level Working Flows
Crawlware Architecture
Job Model
● XML syntax ● HTTP Get, Post, Next Page
● An assembly of reusable actions ● Page Extractor, Link Extractor
● A context shared at runtime ● File Storage
● Job & Actions customized through ● Assignment
properties ● Code Snippet
Job Model Sample
Payload Generation & Scheduling
KeyGeneratorIncrement
Crawl job 1 KeyGeneratorDecrement
Crawl job 2
KeyGeneratorFile
Crawl job 3
Crawl job 4 KeyGeneratorDecorator
Key Generator KeyGeneratorDate
...... KeyGeneratorDateRange
Config
1) Push KeyGeneratorCustomRange
Crawl job 5 KeyGeneratorComposite files
payloads

Job DB 2) Load payloads Read frequency settings

Scheduler
Redis Queue 3) Push payload blocks

Crawl job 1 Crawl job 1 Crawl job 2 Crawl job 3 ...... Crawl job 4 Payload Block

1024 Payloads
Rate Control
Scheduler
● Site frequency configuration

●A given site's payloads amount in


payload block is determined by the
crawling frequency.
Redis Queue
● Scheduler controls the crawling
Pull payload blocks Pull payload blocks rate of the entire system (N crawler
nodes/IPs)

Worker Controller Worker Controller ●Worker Controller controls the


crawling rate of a single node/IP
Pull payloads Pull payloads

In-mem Queue In-mem Queue

Worker Worker Worker Worker Worker Worker


Auto Deduplication

Dedup DB Job DB
Get unique links and
push them in dedup db

Dedup rpc Job DB rpc


Push unique links into Job DB

Dedup links

Crawl brief page


Extract links
Crawl Job
Crawler Testing with Sinatra

● What is Sinatra

Crawler Testing

● Simulate various crawling actions via http, such as Get, Post,


NextPage.
● Simulate job profiles
Encountered Problems & TODOs
● Changing site load/performance
● Monitoring
● Dynamic rate switch based on time zone

Page Correctness
● Page tracker – continuous errors or identical pages

Data Freshness
● Scheduled updates
● Crawl delta for ID or date range payloads
● Recrawl for keyword payloads

Javascript
Thank You

Please contact [email protected] for job opportunities.

You might also like