Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
33 views5 pages

Nutch Api Documentation

This document outlines the process for creating and submitting jobs to fetch web pages using the Nutch REST API. It details the steps involved in creating seed lists, submitting jobs for various phases (inject, generate, fetch, parse, update, and index), and managing job status. The workflow includes specific API endpoints, request body formats, and expected responses for each operation.

Uploaded by

soniyk40
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views5 pages

Nutch Api Documentation

This document outlines the process for creating and submitting jobs to fetch web pages using the Nutch REST API. It details the steps involved in creating seed lists, submitting jobs for various phases (inject, generate, fetch, parse, update, and index), and managing job status. The workflow includes specific API endpoints, request body formats, and expected responses for each operation.

Uploaded by

soniyk40
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Purpose of this document is to describe a flow of creation/submission job to fetch web pages

from given seed with urls

1.​ NUTCH REST API


2.​ Workflow to process URLs

NUTCH REST API


Resource Def Request Body Format Response Description

Create Seed List { /tmp/1428399198700-0 Resource creates


"id": "12345", a seed file with
POST "name": "nutch", URLs to be
seed/create/ "seedUrls": [ processed.
{
"id": 1, Response
"seedList": null, contains a plain
"url": "http://nutch.apache.org/" text with path to
} directory with
] generated seed
} file with urls

Submit Nutch Job { sample-crawl-01-default-INJECT-1661 Resource


for execution "args": { 0159 submits job of
"seedDir": "/tmp/1428399198700-0" specific type for
POST }, execution.
/job/create "confId": "default",
"crawlId": "sample-crawl-01", Response
"type": "INJECT" contains a plain
} text with job id
Request Body definition:
args map of job parameters, e.g. for
INJECT job type “seedDir” is a
required parameter that defines
location of seed file

confId Job execution configuration.


Default nutch configuration is
used in this example

crawlId Crawld Id

type Type of job to run. Possible


values:
1.​ INJECT
2.​ GENERATE
3.​ FETCH
4.​ PARSE
5.​ UPDATEDB
6.​ INDEX
7.​ READDB
8.​ CLASS

Get Nutch Job JSON with Job Status (omitted due to Get Job Status
Status the size) with detailed
information
GET
/job/{jobId}

Stop Running Nutch (true|false) Stop the job with


Job the possibility to
resume.
GET
/job/{jobId}/stop?crawl
Id={crawlId}

Kill Running Nutch (true|false) Kill the job


Job

GET
/job/{jobId}/abort?craw
lId={crawlId}

Get Nutch Server JSON with Job Status (omitted due to Returns Nutch
Status the size) REST Server
Status
GET
/admin

Get Named { Returns named


...
Configuration configuration with
"anchorIndexingFilter.deduplicate": "false",
"crawl.gen.delay": "604800000", specified
GET "db.fetch.interval.default": "2592000", parameters and
/config/{configId} "db.fetch.interval.max": "7776000", values
"db.fetch.retry.max": "3",
"db.fetch.schedule.adaptive.dec_rate": "0.2",
"db.fetch.schedule.adaptive.inc_rate": "0.4",
"db.fetch.schedule.adaptive.max_interval":
"31536000.0",
"db.fetch.schedule.adaptive.min_interval":
"60.0",
"db.fetch.schedule.adaptive.sync_delta":
"true",
"db.fetch.schedule.adaptive.sync_delta_rate":
"0.3",
"db.fetch.schedule.class":
"org.apache.nutch.crawl.DefaultFetchSchedule"
….
}

Save/Update Named { configId To differentiate


Config "configId": "generate-${ID}", jobs configs you
"force": "false", need to supply
POST "params": { “nutch.conf.uuid”
“nutch.conf.uuid":
/config/{configId} "fd777fcc-48e9-4f3f-94a5-841b0bf1de96",
parameter with
"anchorIndexingFilter.deduplicate": "false", uuid dedicated to
"crawl.gen.delay": "604800000", job. Batch id
"db.fetch.interval.default": "2592000", could be used
"db.fetch.interval.max": "7776000",
"db.fetch.retry.max": "3",
"db.fetch.schedule.adaptive.dec_rate": 0.2",
"db.fetch.schedule.adaptive.inc_rate": "0.4"
}
}
Workflow to fetch and process URLs
In order to fetch/process URL the following sequence of steps/phases should be performed:
1.​ Create Seed
2.​ Inject phase
3.​ Generate phase
4.​ Fetch phase
5.​ Parse phase
6.​ UpdateDB phase
7.​ Index phase
In order to get required depth of links steps #5-#9 should be repeated N times(rounds),
where N is the depth.

Step# Resource Request Body Response

1. Create Seed List with POST { /tmp/1428399198700-0


URLs to be fetched and /seed/creat "id": "12345",
processed e "name": "doandodge",
"seedUrls": [
{
"id": 1,
"seedList": null,
"url": "http://nutch.apache.org/"
}
]
}

2. Create Configuration POST { 1c71cd51-b19c-4963-980e-6e


for for all phases (modify /config/1c7 "configId":"1c71cd51-b19c-4963-980e-6eb688c54b46", b688c54b46
“default” config) 1cd51-b19c- "force": "false",
4963-980e-6 "params" : {
eb688c54b4
6 "nutch.conf.uuid":"1c71cd51-b19c-4963-980e-6eb688c54b46",
"mapred.reduce.tasks.speculative.execution":false,
"mapred.map.tasks.speculative.execution" : false,
"mapred.compress.map.output" : true,
"mapred.reduce.tasks" : 2,
"fetcher.timelimit.mins": 180,
"mapred.skip.attempts.to.start.skipping" : 2,
"mapred.skip.map.max.skip.records" : 1,
...
}
}

3. Inject Seed List POST { sample-crawl-01-default-IN


/job/create "args": { JECT-16610159
"seedDir": "/tmp/1428399198700-0"
},
"confId": "1c71cd51-b19c-4963-980e-6eb688c54b46",
"crawlId": "sample-crawl-01",
"type": "INJECT"
}

4. Check Job Status GET {


/job/sampl "args": {
e-crawl-01- "seedDir": "/tmp/1428399198700-0"
default-INJ },
ECT-16610 "confId": "default",
159 "crawlId": "sample-crawl-01",
"id": "sample-crawl-01-default-INJECT-16610159",
"msg": "OK",
"result": {
"jobs": {
"[sample-crawl-01]inject
/tmp/1428399198700-0-job_local628542550_0001": {
"counters": {
"File Input Format Counters ": {
"BYTES_READ": 117
},
"File Output Format Counters ": {
"BYTES_WRITTEN": 0
},
"FileSystemCounters": {
"FILE_BYTES_READ": 608229,
"FILE_BYTES_WRITTEN": 692685
},
"Map-Reduce Framework": {
"COMMITTED_HEAP_BYTES": 95944704,
"CPU_MILLISECONDS": 0,
"MAP_INPUT_RECORDS": 1,
"MAP_OUTPUT_RECORDS": 1,
"PHYSICAL_MEMORY_BYTES": 0,
"SPILLED_RECORDS": 0,
"SPLIT_RAW_BYTES": 118,
"VIRTUAL_MEMORY_BYTES": 0
},
"injector": {
"urls_injected": 1
}
},
"jobID": {
"id": 1,
"jtIdentifier": "local628542550"
},
"jobName": "[sample-crawl-01]inject
/tmp/1428399198700-0"
}
}
},
"state": "FINISHED",
"type": "INJECT"
}

5. Run Generate Job POST { sample-crawl-01-default-IN


/job/create "args": { JECT-679791135
"normalize": false,
"filter": true,
"crawlId" : "sample-crawl-01",
"curTime": 1428526896161, // currentTime, should be
generated on each iteration. and after inject phase on first round
"batch" : "1428526896161-4430" // roundId
(time+randomInt)
},
"confId": "1c71cd51-b19c-4963-980e-6eb688c54b46",
"crawlId": "sample-crawl-01",
"type": "GENERATE"
}

6. Run Fetch Job to fetch POST { sample-crawl-01-fetch-FET


content /job/create "args": { CH-326084837
"threads": 50,
"crawlId" : "sample-crawl-01",
"batch" : "1428526896161-4430"
},
"confId": "1c71cd51-b19c-4963-980e-6eb688c54b46",
"crawlId": "sample-crawl-01",
"type": "FETCH"
}

7. Run Parse Job to POST { sample-crawl-01-parse-PA


parse downloaded /job/create "args": { RSE-1159653222
content "crawlId" : "sample-crawl-01",
"batch" : "1428526896161-4430"
},
"confId": "1c71cd51-b19c-4963-980e-6eb688c54b46",
"crawlId": "sample-crawl-01",
"type": "PARSE"
}

8. Run DB Update Job POST { sample-crawl-01-parse-UP


/job/create "args": { DATEDB-610630639
"crawlId" : "sample-crawl-01",
"batch" : "1428526896161-4430"
},
"confId": "1c71cd51-b19c-4963-980e-6eb688c54b46",
"crawlId": "sample-crawl-01",
"type": "UPDATEDB"
}

9. Run Index Job POST { sample-crawl-01-parse-IND


(Optional step, need /job/create "args": { EX-B-610630639
additional indexer "crawlId" : "sample-crawl-01",
configuration to be "batch" : "1428496122-4430"
},
applied) "confId": "1c71cd51-b19c-4963-980e-6eb688c54b46",
"crawlId": "sample-crawl-01",
"type": "INDEX"
}

You might also like