Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure. Learn more →
Top 23 Python web-scraping Projects
-
What I have seen it is hard to tell what "serious scrapers" use. They use many things. Some use this, some not. This is what I have learned reading webscraping on reddit. Nobody speaks things like that out loud.
There are many tools, see links below
Personally I think that running selenium can be a bottle neck, as it does not play nice, sometimes processes break, even system sometimes requires restart because of things blocked, can be memory hog, etc. etc. That is my experience.
To be able to scale I think you have to have your own implementation. Serious scrapers complain about people using selenium, or derivatives as noobs, who will come back asking why page X does not work in scraping mechanisms.
https://github.com/lexiforest/curl_cffi
https://github.com/encode/httpx
https://github.com/scrapy/scrapy
https://github.com/apify/crawlee
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
changedetection.io
Best and simplest tool for website change detection, web page monitoring, and website change alerts. Perfect for tracking content changes, price drops, restock alerts, and website defacement monitoring—all for free or enjoy our SaaS plan!
Project mention: rostral.io VS changedetection.io - a user suggested alternative | libhunt.com/r/rostral.io | 2025-08-05Rostral.io is an open-source tool for document monitoring with a focus on semantic analysis. Unlike traditional change detectors that track text differences, it processes PDFs, HTML, and JSON using local LLMs (like Deepseek) to identify meaningful modifications in legal texts, contracts, or regulations. The system uses YAML templates to define monitoring rules and can integrate with analysis tools. Designed for researchers and analysts who need to track substantive changes rather than just surface-level edits. Self-hosted for data-sensitive workflows.
-
-
Douyin_TikTok_Download_API
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。
-
Project mention: Scraping German Rental Price Data – Part I: Whole Lotta Captchas | news.ycombinator.com | 2025-07-29
Not yet! But it's on my list to try out next after giving SeleniumBase[1] a chance.
[1] https://github.com/seleniumbase/SeleniumBase
-
Scrapling
🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!
Project mention: Agentic Coding Slot Machines – Did We Just Summon a Genie Addiction? – Part 1 | news.ycombinator.com | 2025-07-03 -
crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Project mention: Launching Crawlee for Python v1.0 to simplify building web scrapers and crawlers | news.ycombinator.com | 2025-09-30 -
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
-
trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Project mention: Trafilatura: A tool and library to gather text and metadata on the Web | news.ycombinator.com | 2025-05-28 -
curl_cffi
Python binding for curl-impersonate fork via cffi. A http client that can impersonate browser tls/ja3/http2 fingerprints.
What I have seen it is hard to tell what "serious scrapers" use. They use many things. Some use this, some not. This is what I have learned reading webscraping on reddit. Nobody speaks things like that out loud.
There are many tools, see links below
Personally I think that running selenium can be a bottle neck, as it does not play nice, sometimes processes break, even system sometimes requires restart because of things blocked, can be memory hog, etc. etc. That is my experience.
To be able to scale I think you have to have your own implementation. Serious scrapers complain about people using selenium, or derivatives as noobs, who will come back asking why page X does not work in scraping mechanisms.
https://github.com/lexiforest/curl_cffi
https://github.com/encode/httpx
https://github.com/scrapy/scrapy
https://github.com/apify/crawlee
-
Skill_Seekers
Convert documentation websites, GitHub repositories, and PDFs into Claude AI skills with automatic conflict detection
Project mention: Turn Docs, Code and PDFs into Claude AI Skills in Minutes | news.ycombinator.com | 2025-11-07 -
-
-
agentql
AgentQL is a suite of tools for connecting your AI to the web. Featuring a query language and Playwright integrations for interacting with elements and extracting data quickly, precisely, and at scale. Includes REST API, Python and JavaScript SDKs, browser debugger.
Project mention: AgentQL MCP Server: Structured Web Data for Claude, Cursor, Windsurf, and more | dev.to | 2025-03-12Get the social links from agentql.com
-
web-scraping
Detailed web scraping tutorials for dummies with financial data crawlers on Reddit WallStreetBets, CME (both options and futures), US Treasury, CFTC, LME, MacroTrends, SHFE and alternative data crawlers on Tomtom, BBC, Wall Street Journal, Al Jazeera, Reuters, Financial Times, Bloomberg, CNN, Fortune, The Economist
-
At scrapfly, we are dedicated to provide developer withs all the resources they need to reach their scraping goals. Check out our comprehensive guide on scraping tiktok as well as our example tiktok scraper using Scrapfly's APIs on github.
-
-
-
wayback-machine-scraper
A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
-
letterboxd_recommendations
Scraping publicly-accessible Letterboxd data and creating a movie recommendation model with it that can generate recommendations when provided with a Letterboxd username
-
-
scrapper
Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.
-
facebook_page_scraper
Scrapes facebook's pages front end with no limitations & provides a feature to turn data into structured JSON or CSV
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python web-scraping discussion
Python web-scraping related posts
-
How I Block All 26M of Your Curl Requests
-
I Connected 3 MCP Servers to Claude & Built a No-Code Research Agent That Actually Cites Sources
-
Agentic Coding Slot Machines – Did We Just Summon a Genie Addiction? – Part 1
-
Trafilatura: A tool and library to gather text and metadata on the Web
-
This Week In Python
-
Creating self-healing spiders with Scrapling in Python without AI (Web Scraping)
-
Scrapling v0.2.99 – Easy, effortless Web Scraping With Python as it should be
-
A note from our sponsor - Stream
getstream.io | 15 Nov 2025
Index
What are some of the best open-source web-scraping projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | Scrapy | 58,941 |
| 2 | changedetection.io | 28,455 |
| 3 | Scrapegraph-ai | 21,751 |
| 4 | Douyin_TikTok_Download_API | 14,850 |
| 5 | SeleniumBase | 11,864 |
| 6 | Scrapling | 8,110 |
| 7 | crawlee-python | 7,145 |
| 8 | autoscraper | 7,022 |
| 9 | trafilatura | 4,898 |
| 10 | curl_cffi | 4,432 |
| 11 | Skill_Seekers | 3,693 |
| 12 | snoop | 3,496 |
| 13 | Grab | 2,428 |
| 14 | agentql | 1,005 |
| 15 | web-scraping | 831 |
| 16 | scrapfly-scrapers | 750 |
| 17 | google-search-results-python | 709 |
| 18 | scrapy-fake-useragent | 692 |
| 19 | wayback-machine-scraper | 453 |
| 20 | letterboxd_recommendations | 356 |
| 21 | twitter-scraper-selenium | 337 |
| 22 | scrapper | 286 |
| 23 | facebook_page_scraper | 262 |