Top 23 Python web-scraping Projects

Scrapy

1 190 58,941 9.5 Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

Project mention: How I Block All 26M of Your Curl Requests | news.ycombinator.com | 2025-10-02

What I have seen it is hard to tell what "serious scrapers" use. They use many things. Some use this, some not. This is what I have learned reading webscraping on reddit. Nobody speaks things like that out loud.
There are many tools, see links below
Personally I think that running selenium can be a bottle neck, as it does not play nice, sometimes processes break, even system sometimes requires restart because of things blocked, can be memory hog, etc. etc. That is my experience.
To be able to scale I think you have to have your own implementation. Serious scrapers complain about people using selenium, or derivatives as noobs, who will come back asking why page X does not work in scraping mechanisms.
https://github.com/lexiforest/curl_cffi
https://github.com/encode/httpx
https://github.com/scrapy/scrapy
https://github.com/apify/crawlee
InfluxDB

www.influxdata.com featured

InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
changedetection.io

2 199 28,455 9.7 Python

Best and simplest tool for website change detection, web page monitoring, and website change alerts. Perfect for tracking content changes, price drops, restock alerts, and website defacement monitoring—all for free or enjoy our SaaS plan!

Project mention: rostral.io VS changedetection.io - a user suggested alternative | libhunt.com/r/rostral.io | 2025-08-05

Rostral.io is an open-source tool for document monitoring with a focus on semantic analysis. Unlike traditional change detectors that track text differences, it processes PDFs, HTML, and JSON using local LLMs (like Deepseek) to identify meaningful modifications in legal texts, contracts, or regulations. The system uses YAML templates to define monitoring rules and can integrate with analysis tools. Designed for researchers and analysts who need to track substantive changes rather than just surface-level edits. Self-hosted for data-sensitive workflows.
Scrapegraph-ai

3 12 21,751 9.2 Python

Python scraper based on AI

Project mention: ScrapeGraphAI Release Week | news.ycombinator.com | 2025-07-07
Douyin_TikTok_Download_API

4 3 14,850 7.2 Python

🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具，支持API调用，在线批量解析及下载。
SeleniumBase

5 22 11,864 9.8 Python

Python APIs for web automation, testing, and bypassing bot-detection with ease.

Project mention: Scraping German Rental Price Data – Part I: Whole Lotta Captchas | news.ycombinator.com | 2025-07-29

Not yet! But it's on my list to try out next after giving SeleniumBase[1] a chance.
[1] https://github.com/seleniumbase/SeleniumBase
Scrapling

6 6 8,110 9.8 Python

🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!

Project mention: Agentic Coding Slot Machines – Did We Just Summon a Genie Addiction? – Part 1 | news.ycombinator.com | 2025-07-03
crawlee-python

7 15 7,145 9.8 Python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Project mention: Launching Crawlee for Python v1.0 to simplify building web scrapers and crawlers | news.ycombinator.com | 2025-09-30
Stream

getstream.io featured

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
autoscraper

8 9 7,022 4.9 Python

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
trafilatura

9 15 4,898 6.8 Python

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Project mention: Trafilatura: A tool and library to gather text and metadata on the Web | news.ycombinator.com | 2025-05-28
curl_cffi

10 12 4,432 9.0 Python

Python binding for curl-impersonate fork via cffi. A http client that can impersonate browser tls/ja3/http2 fingerprints.

Project mention: How I Block All 26M of Your Curl Requests | news.ycombinator.com | 2025-10-02

What I have seen it is hard to tell what "serious scrapers" use. They use many things. Some use this, some not. This is what I have learned reading webscraping on reddit. Nobody speaks things like that out loud.
There are many tools, see links below
Personally I think that running selenium can be a bottle neck, as it does not play nice, sometimes processes break, even system sometimes requires restart because of things blocked, can be memory hog, etc. etc. That is my experience.
To be able to scale I think you have to have your own implementation. Serious scrapers complain about people using selenium, or derivatives as noobs, who will come back asking why page X does not work in scraping mechanisms.
https://github.com/lexiforest/curl_cffi
https://github.com/encode/httpx
https://github.com/scrapy/scrapy
https://github.com/apify/crawlee
Skill_Seekers

11 2 3,693 8.4 Python

Convert documentation websites, GitHub repositories, and PDFs into Claude AI skills with automatic conflict detection

Project mention: Turn Docs, Code and PDFs into Claude AI Skills in Minutes | news.ycombinator.com | 2025-11-07
snoop

12 7 3,496 9.0 Python

Snoop — инструмент разведки на основе открытых данных (OSINT world)
Grab

13 0 2,428 9.2 Python

Web Scraping Framework
agentql

14 4 1,005 7.7 Python

AgentQL is a suite of tools for connecting your AI to the web. Featuring a query language and Playwright integrations for interacting with elements and extracting data quickly, precisely, and at scale. Includes REST API, Python and JavaScript SDKs, browser debugger.

Project mention: AgentQL MCP Server: Structured Web Data for Claude, Cursor, Windsurf, and more | dev.to | 2025-03-12

Get the social links from agentql.com
web-scraping

15 43 831 0.0 Python

Detailed web scraping tutorials for dummies with financial data crawlers on Reddit WallStreetBets, CME (both options and futures), US Treasury, CFTC, LME, MacroTrends, SHFE and alternative data crawlers on Tomtom, BBC, Wall Street Journal, Al Jazeera, Reuters, Financial Times, Bloomberg, CNN, Fortune, The Economist
scrapfly-scrapers

16 5 750 9.5 Python

Scalable Python web scraping scripts for +40 popular domains

Project mention: A Comprehensive Guide to TikTok API | dev.to | 2025-03-20

At scrapfly, we are dedicated to provide developer withs all the resources they need to reach their scraping goals. Check out our comprehensive guide on scraping tiktok as well as our example tiktok scraper using Scrapfly's APIs on github.
google-search-results-python

17 4 709 5.0 Python

Google Search Results via SERP API pip Python Package
scrapy-fake-useragent

18 3 692 2.3 Python

Random User-Agent middleware based on fake-useragent
wayback-machine-scraper

19 6 453 0.0 Python

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
letterboxd_recommendations

20 3 356 8.4 Python

Scraping publicly-accessible Letterboxd data and creating a movie recommendation model with it that can generate recommendations when provided with a Letterboxd username
twitter-scraper-selenium

21 2 337 4.7 Python

Python's package to scrap Twitter's front-end easily
scrapper

22 1 286 6.5 Python

Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.
facebook_page_scraper

23 1 262 6.8 Python

Scrapes facebook's pages front end with no limitations & provides a feature to turn data into structured JSON or CSV
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python web-scraping discussion

Python web-scraping related posts

How I Block All 26M of Your Curl Requests

6 projects | news.ycombinator.com | 2 Oct 2025
I Connected 3 MCP Servers to Claude & Built a No-Code Research Agent That Actually Cites Sources

5 projects | dev.to | 25 Sep 2025
Agentic Coding Slot Machines – Did We Just Summon a Genie Addiction? – Part 1

3 projects | news.ycombinator.com | 3 Jul 2025
Trafilatura: A tool and library to gather text and metadata on the Web

1 project | news.ycombinator.com | 28 May 2025
This Week In Python

5 projects | dev.to | 16 May 2025
Creating self-healing spiders with Scrapling in Python without AI (Web Scraping)

2 projects | dev.to | 4 May 2025
Scrapling v0.2.99 – Easy, effortless Web Scraping With Python as it should be

1 project | news.ycombinator.com | 30 Apr 2025
A note from our sponsor - SaaSHub
www.saashub.com | 15 Nov 2025

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source web-scraping projects in Python? This list will help you:

#	Project	Stars
1	Scrapy	58,941
2	changedetection.io	28,455
3	Scrapegraph-ai	21,751
4	Douyin_TikTok_Download_API	14,850
5	SeleniumBase	11,864
6	Scrapling	8,110
7	crawlee-python	7,145
8	autoscraper	7,022
9	trafilatura	4,898
10	curl_cffi	4,432
11	Skill_Seekers	3,693
12	snoop	3,496
13	Grab	2,428
14	agentql	1,005
15	web-scraping	831
16	scrapfly-scrapers	750
17	google-search-results-python	709
18	scrapy-fake-useragent	692
19	wayback-machine-scraper	453
20	letterboxd_recommendations	356
21	twitter-scraper-selenium	337
22	scrapper	286
23	facebook_page_scraper	262

Python web-scraping

Top 23 Python web-scraping Projects

Python web-scraping discussion

Python web-scraping related posts

How I Block All 26M of Your Curl Requests

I Connected 3 MCP Servers to Claude & Built a No-Code Research Agent That Actually Cites Sources

Agentic Coding Slot Machines – Did We Just Summon a Genie Addiction? – Part 1

Trafilatura: A tool and library to gather text and metadata on the Web

This Week In Python

Creating self-healing spiders with Scrapling in Python without AI (Web Scraping)

Scrapling v0.2.99 – Easy, effortless Web Scraping With Python as it should be

Index

Did you know that Python is the 2nd most popular programming language based on number of references?

Did you know that Python is
the 2nd most popular programming language
based on number of references?