Simple phpBB forum thread web scraper written in Python.
Designed for command-line usage. Outputs data as CSV format into stdout.
This is an experiment-driven project. The code tends to be, but it's not fully idiomatic according to PEP8. The current implementation is very ad-hoc for a concrete particular scenario, however extending it to cover additional behavior and features should be trivial.
The scraped data fields per thread post are (in order): Post ID, Post name, Date of the post and Post body
Uses urllib3 for HTTP networking and BeautifulSoup for HTML parsing.
This package is not available via pip.
You must download or clone this repository in order to use it.
- python
+3(developed using [email protected]) - pip (optional)
Clone this repository:
git clone https://github.com/h2non/bbscraper.git && cd bbscraperInstall dependencies via pip:
sudo pip install -r requirements.txtOr alternatively using setup.py:
python setup.py installusage: __main__.py [-h] -u URL [-f FORMAT] [-l LIMIT]
Scrape all thread posts of a phpBB based forum
optional arguments:
-h, --help show this help message and exit
-u URL, --url URL Full URL to forum thread
-f FORMAT, --format FORMAT
Output format (default to CSV)
Report any issues to https://github.com/h2non/bbscraper/issuesScrap the website and save data in forum.csv:
python bbscraper -u http://www.oldclassiccar.co.uk/forum/phpbb/phpBB2/viewtopic.php?t=12591 > forum.csvRun tests:
make testMIT - Tomas Aparicio