This repository was archived by the owner on Nov 15, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
mlalma/NanoCrawler
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
NanoCrawler is a lightweight web crawler library written in Java and based on crawler4j 3.5 by Yasser Ganjisaffar (see more on https://code.google.com/p/crawler4j/). The code has been refactored, some small bugs have been fixed and some new features have been added such as the support for URL link prioritization. The web pages are now parsed using Validator.nu HTML parser (http://about.validator.nu/htmlparser/). NanoCrawler-Simple-TestApp is a simple test application based on respective crawler4j's example app that shows NanoCrawler in action. NanoCrawler-RSSFeedFinder-App is a bit more complex example that searches for RSS feeds from a given domain. To run the application, you need to compile and include following CFTA libraries (can be found from https://github.com/mlalma/CFTA): - yarfraw (RSS feed parser) - CFTA-Lib-RSSFeedFetcher (API for RSS feed parsing) - CFTA-Lib-WebFetcher (Required by RSS feed parser libs) - CFTA-Lib-FrontEnd-DataStructures and CFTA-Lib-Util (Utility methods)
About
Lightweight web crawler library written in Java and based on crawler4j
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published