Purpose of this document is to describe a flow of creation/submission job to fetch web pages
from given seed with urls
1. NUTCH REST API
2. Workflow to process URLs
NUTCH REST API
Resource Def Request Body Format Response Description
Create Seed List { /tmp/1428399198700-0 Resource creates
"id": "12345", a seed file with
POST "name": "nutch", URLs to be
seed/create/ "seedUrls": [ processed.
{
"id": 1, Response
"seedList": null, contains a plain
"url": "http://nutch.apache.org/" text with path to
} directory with
] generated seed
} file with urls
Submit Nutch Job { sample-crawl-01-default-INJECT-1661 Resource
for execution "args": { 0159 submits job of
"seedDir": "/tmp/1428399198700-0" specific type for
POST }, execution.
/job/create "confId": "default",
"crawlId": "sample-crawl-01", Response
"type": "INJECT" contains a plain
} text with job id
Request Body definition:
args map of job parameters, e.g. for
INJECT job type “seedDir” is a
required parameter that defines
location of seed file
confId Job execution configuration.
Default nutch configuration is
used in this example
crawlId Crawld Id
type Type of job to run. Possible
values:
1. INJECT
2. GENERATE
3. FETCH
4. PARSE
5. UPDATEDB
6. INDEX
7. READDB
8. CLASS
Get Nutch Job JSON with Job Status (omitted due to Get Job Status
Status the size) with detailed
information
GET
/job/{jobId}
Stop Running Nutch (true|false) Stop the job with
Job the possibility to
resume.
GET
/job/{jobId}/stop?crawl
Id={crawlId}
Kill Running Nutch (true|false) Kill the job
Job
GET
/job/{jobId}/abort?craw
lId={crawlId}
Get Nutch Server JSON with Job Status (omitted due to Returns Nutch
Status the size) REST Server
Status
GET
/admin
Get Named { Returns named
...
Configuration configuration with
"anchorIndexingFilter.deduplicate": "false",
"crawl.gen.delay": "604800000", specified
GET "db.fetch.interval.default": "2592000", parameters and
/config/{configId} "db.fetch.interval.max": "7776000", values
"db.fetch.retry.max": "3",
"db.fetch.schedule.adaptive.dec_rate": "0.2",
"db.fetch.schedule.adaptive.inc_rate": "0.4",
"db.fetch.schedule.adaptive.max_interval":
"31536000.0",
"db.fetch.schedule.adaptive.min_interval":
"60.0",
"db.fetch.schedule.adaptive.sync_delta":
"true",
"db.fetch.schedule.adaptive.sync_delta_rate":
"0.3",
"db.fetch.schedule.class":
"org.apache.nutch.crawl.DefaultFetchSchedule"
….
}
Save/Update Named { configId To differentiate
Config "configId": "generate-${ID}", jobs configs you
"force": "false", need to supply
POST "params": { “nutch.conf.uuid”
“nutch.conf.uuid":
/config/{configId} "fd777fcc-48e9-4f3f-94a5-841b0bf1de96",
parameter with
"anchorIndexingFilter.deduplicate": "false", uuid dedicated to
"crawl.gen.delay": "604800000", job. Batch id
"db.fetch.interval.default": "2592000", could be used
"db.fetch.interval.max": "7776000",
"db.fetch.retry.max": "3",
"db.fetch.schedule.adaptive.dec_rate": 0.2",
"db.fetch.schedule.adaptive.inc_rate": "0.4"
}
}
Workflow to fetch and process URLs
In order to fetch/process URL the following sequence of steps/phases should be performed:
1. Create Seed
2. Inject phase
3. Generate phase
4. Fetch phase
5. Parse phase
6. UpdateDB phase
7. Index phase
In order to get required depth of links steps #5-#9 should be repeated N times(rounds),
where N is the depth.
Step# Resource Request Body Response
1. Create Seed List with POST { /tmp/1428399198700-0
URLs to be fetched and /seed/creat "id": "12345",
processed e "name": "doandodge",
"seedUrls": [
{
"id": 1,
"seedList": null,
"url": "http://nutch.apache.org/"
}
]
}
2. Create Configuration POST { 1c71cd51-b19c-4963-980e-6e
for for all phases (modify /config/1c7 "configId":"1c71cd51-b19c-4963-980e-6eb688c54b46", b688c54b46
“default” config) 1cd51-b19c- "force": "false",
4963-980e-6 "params" : {
eb688c54b4
6 "nutch.conf.uuid":"1c71cd51-b19c-4963-980e-6eb688c54b46",
"mapred.reduce.tasks.speculative.execution":false,
"mapred.map.tasks.speculative.execution" : false,
"mapred.compress.map.output" : true,
"mapred.reduce.tasks" : 2,
"fetcher.timelimit.mins": 180,
"mapred.skip.attempts.to.start.skipping" : 2,
"mapred.skip.map.max.skip.records" : 1,
...
}
}
3. Inject Seed List POST { sample-crawl-01-default-IN
/job/create "args": { JECT-16610159
"seedDir": "/tmp/1428399198700-0"
},
"confId": "1c71cd51-b19c-4963-980e-6eb688c54b46",
"crawlId": "sample-crawl-01",
"type": "INJECT"
}
4. Check Job Status GET {
/job/sampl "args": {
e-crawl-01- "seedDir": "/tmp/1428399198700-0"
default-INJ },
ECT-16610 "confId": "default",
159 "crawlId": "sample-crawl-01",
"id": "sample-crawl-01-default-INJECT-16610159",
"msg": "OK",
"result": {
"jobs": {
"[sample-crawl-01]inject
/tmp/1428399198700-0-job_local628542550_0001": {
"counters": {
"File Input Format Counters ": {
"BYTES_READ": 117
},
"File Output Format Counters ": {
"BYTES_WRITTEN": 0
},
"FileSystemCounters": {
"FILE_BYTES_READ": 608229,
"FILE_BYTES_WRITTEN": 692685
},
"Map-Reduce Framework": {
"COMMITTED_HEAP_BYTES": 95944704,
"CPU_MILLISECONDS": 0,
"MAP_INPUT_RECORDS": 1,
"MAP_OUTPUT_RECORDS": 1,
"PHYSICAL_MEMORY_BYTES": 0,
"SPILLED_RECORDS": 0,
"SPLIT_RAW_BYTES": 118,
"VIRTUAL_MEMORY_BYTES": 0
},
"injector": {
"urls_injected": 1
}
},
"jobID": {
"id": 1,
"jtIdentifier": "local628542550"
},
"jobName": "[sample-crawl-01]inject
/tmp/1428399198700-0"
}
}
},
"state": "FINISHED",
"type": "INJECT"
}
5. Run Generate Job POST { sample-crawl-01-default-IN
/job/create "args": { JECT-679791135
"normalize": false,
"filter": true,
"crawlId" : "sample-crawl-01",
"curTime": 1428526896161, // currentTime, should be
generated on each iteration. and after inject phase on first round
"batch" : "1428526896161-4430" // roundId
(time+randomInt)
},
"confId": "1c71cd51-b19c-4963-980e-6eb688c54b46",
"crawlId": "sample-crawl-01",
"type": "GENERATE"
}
6. Run Fetch Job to fetch POST { sample-crawl-01-fetch-FET
content /job/create "args": { CH-326084837
"threads": 50,
"crawlId" : "sample-crawl-01",
"batch" : "1428526896161-4430"
},
"confId": "1c71cd51-b19c-4963-980e-6eb688c54b46",
"crawlId": "sample-crawl-01",
"type": "FETCH"
}
7. Run Parse Job to POST { sample-crawl-01-parse-PA
parse downloaded /job/create "args": { RSE-1159653222
content "crawlId" : "sample-crawl-01",
"batch" : "1428526896161-4430"
},
"confId": "1c71cd51-b19c-4963-980e-6eb688c54b46",
"crawlId": "sample-crawl-01",
"type": "PARSE"
}
8. Run DB Update Job POST { sample-crawl-01-parse-UP
/job/create "args": { DATEDB-610630639
"crawlId" : "sample-crawl-01",
"batch" : "1428526896161-4430"
},
"confId": "1c71cd51-b19c-4963-980e-6eb688c54b46",
"crawlId": "sample-crawl-01",
"type": "UPDATEDB"
}
9. Run Index Job POST { sample-crawl-01-parse-IND
(Optional step, need /job/create "args": { EX-B-610630639
additional indexer "crawlId" : "sample-crawl-01",
configuration to be "batch" : "1428496122-4430"
},
applied) "confId": "1c71cd51-b19c-4963-980e-6eb688c54b46",
"crawlId": "sample-crawl-01",
"type": "INDEX"
}