A Powerful Spider(Web Crawler) System in Python. TRY IT NOW!
- Write script in python with powerful API
- Python 2&3
- Powerful WebUI with script editor, task monitor, project manager and result viewer
- Javascript pages supported!
- MySQL, MongoDB, SQLite, PostgreSQL as database backend
- Task priority, retry, periodical, recrawl by age and more
- Distributed architecture
Documentation: http://docs.pyspider.org/
Tutorial: http://docs.pyspider.org/en/latest/tutorial/
from pyspider.libs.base_handler import *
class Handler(BaseHandler):
    crawl_config = {
    }
    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)
    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)
    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }- pip install pyspider
- run command pyspider, visit http://localhost:5000/
Quickstart: http://docs.pyspider.org/en/latest/Quickstart/
- Use It
- Open Issue, send PR
- User Group
- local mode, load script from file.
- works as a framework (all components running in one process, no threads)
- redis
-  shell mode like scrapy shell
- a visual scraping interface like portia
Licensed under the Apache License, Version 2.0