Thanks to visit codestin.com
Credit goes to www.libhunt.com

Python web-scraping

Open-source Python projects categorized as web-scraping

Top 23 Python web-scraping Projects

web-scraping
  1. Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    Project mention: How I Block All 26M of Your Curl Requests | news.ycombinator.com | 2025-10-02

    What I have seen it is hard to tell what "serious scrapers" use. They use many things. Some use this, some not. This is what I have learned reading webscraping on reddit. Nobody speaks things like that out loud.

    There are many tools, see links below

    Personally I think that running selenium can be a bottle neck, as it does not play nice, sometimes processes break, even system sometimes requires restart because of things blocked, can be memory hog, etc. etc. That is my experience.

    To be able to scale I think you have to have your own implementation. Serious scrapers complain about people using selenium, or derivatives as noobs, who will come back asking why page X does not work in scraping mechanisms.

    https://github.com/lexiforest/curl_cffi

    https://github.com/encode/httpx

    https://github.com/scrapy/scrapy

    https://github.com/apify/crawlee

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. changedetection.io

    Best and simplest tool for website change detection, web page monitoring, and website change alerts. Perfect for tracking content changes, price drops, restock alerts, and website defacement monitoring—all for free or enjoy our SaaS plan!

    Project mention: rostral.io VS changedetection.io - a user suggested alternative | libhunt.com/r/rostral.io | 2025-08-05

    Rostral.io is an open-source tool for document monitoring with a focus on semantic analysis. Unlike traditional change detectors that track text differences, it processes PDFs, HTML, and JSON using local LLMs (like Deepseek) to identify meaningful modifications in legal texts, contracts, or regulations. The system uses YAML templates to define monitoring rules and can integrate with analysis tools. Designed for researchers and analysts who need to track substantive changes rather than just surface-level edits. Self-hosted for data-sensitive workflows.

  4. Scrapegraph-ai

    Python scraper based on AI

    Project mention: ScrapeGraphAI Release Week | news.ycombinator.com | 2025-07-07
  5. Douyin_TikTok_Download_API

    🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。

  6. SeleniumBase

    Python APIs for web automation, testing, and bypassing bot-detection with ease.

    Project mention: Scraping German Rental Price Data – Part I: Whole Lotta Captchas | news.ycombinator.com | 2025-07-29

    Not yet! But it's on my list to try out next after giving SeleniumBase[1] a chance.

    [1] https://github.com/seleniumbase/SeleniumBase

  7. Scrapling

    🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!

    Project mention: Agentic Coding Slot Machines – Did We Just Summon a Genie Addiction? – Part 1 | news.ycombinator.com | 2025-07-03
  8. crawlee-python

    Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

    Project mention: Launching Crawlee for Python v1.0 to simplify building web scrapers and crawlers | news.ycombinator.com | 2025-09-30
  9. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  10. autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

  11. trafilatura

    Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

    Project mention: Trafilatura: A tool and library to gather text and metadata on the Web | news.ycombinator.com | 2025-05-28
  12. curl_cffi

    Python binding for curl-impersonate fork via cffi. A http client that can impersonate browser tls/ja3/http2 fingerprints.

    Project mention: How I Block All 26M of Your Curl Requests | news.ycombinator.com | 2025-10-02

    What I have seen it is hard to tell what "serious scrapers" use. They use many things. Some use this, some not. This is what I have learned reading webscraping on reddit. Nobody speaks things like that out loud.

    There are many tools, see links below

    Personally I think that running selenium can be a bottle neck, as it does not play nice, sometimes processes break, even system sometimes requires restart because of things blocked, can be memory hog, etc. etc. That is my experience.

    To be able to scale I think you have to have your own implementation. Serious scrapers complain about people using selenium, or derivatives as noobs, who will come back asking why page X does not work in scraping mechanisms.

    https://github.com/lexiforest/curl_cffi

    https://github.com/encode/httpx

    https://github.com/scrapy/scrapy

    https://github.com/apify/crawlee

  13. Skill_Seekers

    Convert documentation websites, GitHub repositories, and PDFs into Claude AI skills with automatic conflict detection

    Project mention: Turn Docs, Code and PDFs into Claude AI Skills in Minutes | news.ycombinator.com | 2025-11-07
  14. snoop

    Snoop — инструмент разведки на основе открытых данных (OSINT world)

  15. Grab

    Web Scraping Framework

  16. agentql

    AgentQL is a suite of tools for connecting your AI to the web. Featuring a query language and Playwright integrations for interacting with elements and extracting data quickly, precisely, and at scale. Includes REST API, Python and JavaScript SDKs, browser debugger.

    Project mention: AgentQL MCP Server: Structured Web Data for Claude, Cursor, Windsurf, and more | dev.to | 2025-03-12

    Get the social links from agentql.com

  17. web-scraping

    Detailed web scraping tutorials for dummies with financial data crawlers on Reddit WallStreetBets, CME (both options and futures), US Treasury, CFTC, LME, MacroTrends, SHFE and alternative data crawlers on Tomtom, BBC, Wall Street Journal, Al Jazeera, Reuters, Financial Times, Bloomberg, CNN, Fortune, The Economist

  18. scrapfly-scrapers

    Scalable Python web scraping scripts for +40 popular domains

    Project mention: A Comprehensive Guide to TikTok API | dev.to | 2025-03-20

    At scrapfly, we are dedicated to provide developer withs all the resources they need to reach their scraping goals. Check out our comprehensive guide on scraping tiktok as well as our example tiktok scraper using Scrapfly's APIs on github.

  19. google-search-results-python

    Google Search Results via SERP API pip Python Package

  20. scrapy-fake-useragent

    Random User-Agent middleware based on fake-useragent

  21. wayback-machine-scraper

    A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

  22. letterboxd_recommendations

    Scraping publicly-accessible Letterboxd data and creating a movie recommendation model with it that can generate recommendations when provided with a Letterboxd username

  23. twitter-scraper-selenium

    Python's package to scrap Twitter's front-end easily

  24. scrapper

    Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.

  25. facebook_page_scraper

    Scrapes facebook's pages front end with no limitations & provides a feature to turn data into structured JSON or CSV

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python web-scraping discussion

Log in or Post with

Python web-scraping related posts

  • How I Block All 26M of Your Curl Requests

    6 projects | news.ycombinator.com | 2 Oct 2025
  • I Connected 3 MCP Servers to Claude & Built a No-Code Research Agent That Actually Cites Sources

    5 projects | dev.to | 25 Sep 2025
  • Agentic Coding Slot Machines – Did We Just Summon a Genie Addiction? – Part 1

    3 projects | news.ycombinator.com | 3 Jul 2025
  • Trafilatura: A tool and library to gather text and metadata on the Web

    1 project | news.ycombinator.com | 28 May 2025
  • This Week In Python

    5 projects | dev.to | 16 May 2025
  • Creating self-healing spiders with Scrapling in Python without AI (Web Scraping)

    2 projects | dev.to | 4 May 2025
  • Scrapling v0.2.99 – Easy, effortless Web Scraping With Python as it should be

    1 project | news.ycombinator.com | 30 Apr 2025
  • A note from our sponsor - SaaSHub
    www.saashub.com | 15 Nov 2025
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source web-scraping projects in Python? This list will help you:

# Project Stars
1 Scrapy 58,941
2 changedetection.io 28,455
3 Scrapegraph-ai 21,751
4 Douyin_TikTok_Download_API 14,850
5 SeleniumBase 11,864
6 Scrapling 8,110
7 crawlee-python 7,145
8 autoscraper 7,022
9 trafilatura 4,898
10 curl_cffi 4,432
11 Skill_Seekers 3,693
12 snoop 3,496
13 Grab 2,428
14 agentql 1,005
15 web-scraping 831
16 scrapfly-scrapers 750
17 google-search-results-python 709
18 scrapy-fake-useragent 692
19 wayback-machine-scraper 453
20 letterboxd_recommendations 356
21 twitter-scraper-selenium 337
22 scrapper 286
23 facebook_page_scraper 262

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?