A BFS web crawler that crawls HTML datasets, stores pages and link relationships in SQLite, and exposes a REST API for querying popular pages and incoming links.
- Node.js 18+
- npm
npm installCrawl a dataset by name. The crawler uses BFS with 10 concurrent requests.
node crawl.js <datasetName>Available datasets:
| Dataset | Pages | Seed URL |
|---|---|---|
| tinyfruits | 10 | https://people.scs.carleton.ca/~avamckenney/tinyfruits/N-0.html |
| fruits100 | 100 | https://people.scs.carleton.ca/~avamckenney/fruits100/N-0.html |
| fruitsA | ~1000 | https://people.scs.carleton.ca/~avamckenney/fruitsA/N-0.html |
| fruitgraph | 1000 | https://people.scs.carleton.ca/~avamckenney/fruitgraph/N-0.html |
Example:
node crawl.js tinyfruits
node crawl.js fruits100
node crawl.js fruitsACrawl data is stored in data/crawler.db. Re-crawling a dataset clears its previous data first.
SERVER_NAME=yourServerName node src/index.jsThe server runs on port 3000.
Returns the server name. Used by the grading server's INFO test.
{ "name": "yourServerName" }Returns the top 10 pages by incoming link count for a given dataset.
{
"result": [
{ "url": "http://yourHost:3000/pages/1", "origUrl": "https://..." },
{ "url": "http://yourHost:3000/pages/2", "origUrl": "https://..." }
]
}Returns details for a specific crawled page, including its incoming links.
{
"webUrl": "https://...",
"incomingLinks": ["https://...", "https://..."]
}- Create an instance using the
COMP4601A-W26.2026-01-08snapshot - Add the
ping-ssh-egressandweb3000security groups - Assign a floating IP
- SSH into the instance, clone this repo, and run
npm install
node crawl.js tinyfruits
node crawl.js fruits100
node crawl.js fruitsASERVER_NAME=yourServerName node src/index.jsReplace yourServerName with the name you receive after registering (step 4).
curl -X POST http://134.117.26.91:3000/servers \
-H "Content-Type: application/json" \
-d '{
"serverAddress": "http://yourFloatingIP:3000",
"members": ["studentId1", "studentId2"],
"key": "yourServerKey"
}'You will receive a server name in the response. Use this as your SERVER_NAME.
curl -X PUT http://134.117.26.91:3000/servers/yourServerName \
-H "Content-Type: application/json" \
-d '{
"key": "yourServerKey",
"running": { "INFO": true, "L3": true }
}'Visit http://134.117.26.91:3000/results to see your test status.