Thanks to visit codestin.com
Credit goes to github.com

Skip to content
This repository was archived by the owner on Nov 15, 2022. It is now read-only.

mlalma/NanoCrawler

Repository files navigation

NanoCrawler is a lightweight web crawler library written in Java
and based on crawler4j 3.5 by Yasser Ganjisaffar (see more on
https://code.google.com/p/crawler4j/).

The code has been refactored, some small bugs have been fixed
and some new features have been added such as the support for
URL link prioritization. The web pages are now parsed using
Validator.nu HTML parser (http://about.validator.nu/htmlparser/).

NanoCrawler-Simple-TestApp is a simple test application based
on respective crawler4j's example app that shows NanoCrawler in
action.

NanoCrawler-RSSFeedFinder-App is a bit more complex example that
searches for RSS feeds from a given domain. To run the application,
you need to compile and include following CFTA libraries (can be 
found from https://github.com/mlalma/CFTA):
- yarfraw (RSS feed parser)
- CFTA-Lib-RSSFeedFetcher (API for RSS feed parsing)
- CFTA-Lib-WebFetcher (Required by RSS feed parser libs)
- CFTA-Lib-FrontEnd-DataStructures and CFTA-Lib-Util (Utility methods) 

About

Lightweight web crawler library written in Java and based on crawler4j

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages