Simple, easy-to-use scraper to scrape data from WordPress JSON API
- Support storing crawled documents as MongoDB documents / JSON files
- Auto retry upon errors
- Python 3.7+
pip install -r requirements.txtJust run crawl.py with the sites URL supplied:
python3 crawl.py https://your.website.hereThis will crawl the site using DefaultCrawlSession, which attempts to crawl all posts, categories & tags from the site.
The crawled JSON files will be stored in the directory ./data/<domain-name>.
Most of the time, This will suffice when scraping sites that are:
- not required to sign in
- JSON API paths not blocked
For advanced usage and customizations you may want to look at wpscraper/session.py for actual crawling procedures, and make your own CrawlSession.
- Rewrite/Refactor
- MongoDB Connector
- Async session
- Authentication Module
- Cloudflare circumvention
- Configurable retry policies
- Full WPv2 API resources support