Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Web Crawlers orchestration framework that lets you create datasets from multiple web sources using yaml configurations.

Notifications You must be signed in to change notification settings

invana/crawlerflow

Repository files navigation

CrawlerFlow

Web Crawlers orchestration framework that lets you create datasets from multiple web sources using yaml configurations.

Features

Features

  • [*] Write spiders in the YAML configs.
  • [*] Create extractors to scrape data using YAML configs (HTML, API, RSS)
  • [*] Define multiple extractors per spider.
  • [*] Use standard extractors to scrape data like Tables, Paragraphs, Meta tags, JSON+LD of the page.
  • Traverse between multiple websites.
  • Write Python Extractors for advanced extraction strategy

Installation

pip install git+https://github.com/invana/crawlerflow#egg=crawlerflow

Usage

Scraping with CrawlerFlow

from crawlerflow.runner import Crawlerflow
from crawlerflow.utils import yaml_to_json


crawl_requests = yaml_to_json(open("example-configs/crawlerflow/requests/github-detail-urls.yml"))
spider_config = yaml_to_json(open("example-configs/crawlerflow/spiders/default-spider.yml"))
github_default_extractor = yaml_to_json(open("example-configs/crawlerflow/extractors/github-blog-detail.yml"))

flow = Crawlerflow()
flow.add_spider_with_config(crawl_requests, spider_config, default_extractor=github_default_extractor)
flow.start()

Scraping with WebCrawler

from crawlerflow.runner import WebCrawler
from crawlerflow.utils import yaml_to_json

 
scraper_config_files = [
    "example-configs/webcrawler/APISpiders/api-publicapis-org.yml",
    "example-configs/webcrawler/HTMLSpiders/github-blog-list.yml",
    "example-configs/webcrawler/HTMLSpiders/github-blog-detail.yml"
]

crawlerflow = WebCrawler()

for scraper_config_file in scraper_config_files:
    scraper_config = yaml_to_json(open(scraper_config_file))
    crawlerflow.add_spider_with_config(scraper_config)
crawlerflow.start()

Refer examples-configs/ folder for example configs.

Available Extractors

  • [*] HTMLExtractor
  • [*] MetaTagExtractor
  • [*] JSONLDExtractor
  • [*] TableContentExtractor
  • [*] IconsExtractor

About

Web Crawlers orchestration framework that lets you create datasets from multiple web sources using yaml configurations.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages