Top 23 Python Data Science Projects

scikit-learn

1 93 63,971 9.9 Python

scikit-learn: machine learning in Python

Project mention: Open Source Journey | dev.to | 2025-11-01

Start Simple, Build Confidence Project: Scikit-learn After the intense first experience with BEHAVIOR-1K, I needed something more approachable. I went straight to Scikit-learn's good first issue label and found a task that seemed manageable: changing relative imports to absolute imports in Cython files. From this
Stream

getstream.io featured

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
Keras

2 89 63,551 9.8 Python

Deep Learning for humans

Project mention: PyTorch vs TensorFlow 2025: Which one wins after 72 hours? | dev.to | 2025-08-29

Keras 3 multi-backend
Pandas

3 430 47,068 9.9 Python

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Project mention: Node.js vs Python: Real Benchmarks, Performance Insights, and Scalability Analysis | dev.to | 2025-10-04

data analytics stacks (Pandas)
Airflow

4 200 43,200 10.0 Python

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Project mention: What is Argo Workflows? | dev.to | 2025-11-10

Apache Airflow - Apache's Airflow project is a popular workflow system that supports DAG-based tasks and precise scheduling. It's an extensible Python project that supports several different providers and job executors, including Kubernetes.
streamlit

5 321 42,140 9.9 Python

Streamlit — A faster way to build and share data apps.

Project mention: How to Build a RAG Solution with Llama Index, ChromaDB, and Ollama | dev.to | 2025-11-04

With a few lines of Python, you can build a basic retrieval-augmented generation (RAG) solution, but it doesn’t stop here. You can extend this project to search for multiple web pages, load large documents, add a simple web UI using either Streamlit or Anvil, or even experiment with different models in Ollama.
gradio

6 138 40,497 9.8 Python

Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!

Project mention: The Ultimate Guide to Building Stunning AI Apps For Beginners - Gradio | dev.to | 2025-11-14

Why Gradio is the New Superpower for Every AI Learner in 2025
Ray

7 49 39,825 10.0 Python

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Project mention: PyTorch Monarch | news.ycombinator.com | 2025-10-23

Not currently, but it is being worked on https://github.com/ray-project/ray/issues/53976.
InfluxDB

www.influxdata.com featured

InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
spaCy

8 116 32,785 7.4 Python

💫 Industrial-strength Natural Language Processing (NLP) in Python

Project mention: Strengthening Open-Source Integrity: My First Contribution to spaCy | dev.to | 2025-10-28

🔗 Pull Request: #13877 — Remove spaCy Quickstart from Universe/Courses due to spam redirect
pytorch-lightning

9 9 30,432 9.9 Python

Pretrain, finetune ANY AI model of ANY size on 1 or 10,000+ GPUs with zero code changes.
ML-From-Scratch

10 5 28,738 0.0 Python

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

Project mention: Open Source? Open Mind! | dev.to | 2025-09-05

So.. which Open Source project did you chose? https://github.com/eriklindernoren/ML-From-Scratch Okay, I know I said much things about moving out of comfort zone but I am still bit scared and worried. So, I've decided to start from an area that I am familiar with. First of all, this open source project is called "ML-From-Scratch". It's a learning resource that demystifies machine learning by showing the fundamental code for a wide range of models. Past few months, I have been studying Machin Learning with Data science. This project reveals what is actually going on behind the libraries and algorithms so people can understand the core functionality of machine learning (ML). This choice feels right for my "coder to developer" journey. As I contribute to the project, I will be exposed to deeper knowledge of machine learning.
data-science-ipython-notebooks

11 1 28,539 0.0 Python

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
d2l-en

12 6 26,601 2.9 Python

Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
dash

13 57 24,233 9.7 Python

Data Apps & Dashboards for Python. No JavaScript Required.

Project mention: Other Visualization Tools: Dashboards & Reports | dev.to | 2025-09-14

Cloud Deployment: Dash apps can be deployed to Dash Enterprise or Heroku.
best-of-ml-python

14 18 22,786 8.1 Python

🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.

Project mention: A ranked list of machine learning Python libraries. Updated weekly | news.ycombinator.com | 2025-01-31
pandas-ai

15 21 22,534 9.3 Python

Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.

Project mention: Pandas AI | news.ycombinator.com | 2025-07-18
matplotlib

16 39 21,982 9.9 Python

matplotlib: plotting with Python

Project mention: How to Get Started with Scikit-Learn: A Beginner-Friendly Guide to Machine Learning in Python | dev.to | 2025-04-24

As is the case with most Python libraries, it is open-source and free-to-use, making it easily accessible by anyone willing to learn machine learning, and it is built upon other open-source libraries within Python, like SciPy for advanced scientific operations, NumPy for efficient numerical computations, Matplotlib for data visualization, and Cython for increased efficiency and speed, similar to that of C/C++.
recommenders

17 7 21,100 9.4 Python

Best Practices on Recommendation Systems
Prefect

18 20 20,783 9.9 Python

The easiest way to build, run, and monitor data pipelines at scale.

Project mention: Show HN: Flow – A Dynamic Task Engine for AI Agents Without DAG | news.ycombinator.com | 2024-12-02

- https://github.com/PrefectHQ/prefect
marimo

19 43 17,232 10.0 Python

A reactive notebook for Python — run reproducible experiments, query with SQL, execute as a script, deploy as an app, and version with git. Stored as pure Python. All in a modern, AI-native editor.

Project mention: We're open-sourcing the successor of Jupyter notebook | news.ycombinator.com | 2025-11-04

The successor to Jupyter notebook is Marimo, https://marimo.io/ because they are pure code, not code in json. First class everywhere.
ipython

20 36 16,601 9.8 Python

Official repository for IPython itself. Other repos in the IPython organization contain things like the website, documentation builds, etc.

Project mention: Reloading Classes in Python | news.ycombinator.com | 2025-08-29

Pickling + unpickling the object is a neat trick to update objects to point to the new methods, but it's even more straightforward to just patch `obj.__class__ = reloaded_module.NewClass`. This is what ipython's autoreload extension used to do, though nowadays it's had some improvements over this approach: https://github.com/ipython/ipython/pull/14500
gensim

21 18 16,267 7.9 Python

Topic Modelling for Humans
dvc

22 121 15,089 9.2 Python

🦉 Data Versioning and ML Experiments

Project mention: Ask HN: What is the simplest data orchestration tool you've worked with? | news.ycombinator.com | 2025-03-21
dagster

23 57 14,423 10.0 Python

An orchestration platform for the development, production, and observation of data assets.

Project mention: Fixing Type Hints for Callable Objects with Custom Signatures in Dagster | dev.to | 2025-10-28

Finding the Issue I was browsing through their GitHub issues, I found Issue #32574: "Callable object custom signatures are resolved incorrectly." Issue-32574 At first glance, I thought "Oh cool, this looks easy." But then I read the details and realized this was actually pretty interesting.
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Data Science discussion

Python Data Science related posts

TabPFN-2.5 – SOTA foundation model for tabular data

2 projects | news.ycombinator.com | 6 Nov 2025
We're open-sourcing the successor of Jupyter notebook

4 projects | news.ycombinator.com | 4 Nov 2025
Fixing Type Hints for Callable Objects with Custom Signatures in Dagster

1 project | dev.to | 28 Oct 2025
Strengthening Open-Source Integrity: My First Contribution to spaCy

1 project | dev.to | 28 Oct 2025
LLMZ25-2 Review : Construyendo Interfaces LLM con Streamlit

2 projects | dev.to | 25 Oct 2025
LLMZ25-1 Review : Streamlit La Herramienta Perfecta para Interfaces de Proyectos LLM

1 project | dev.to | 25 Oct 2025
Installing FFCV and Fastxtend on Windows with Micromamba and MSVC

1 project | dev.to | 24 Oct 2025
A note from our sponsor - InfluxDB
www.influxdata.com | 15 Nov 2025

InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →

Index

What are some of the best open-source Data Science projects in Python? This list will help you:

#	Project	Stars
1	scikit-learn	63,971
2	Keras	63,551
3	Pandas	47,068
4	Airflow	43,200
5	streamlit	42,140
6	gradio	40,497
7	Ray	39,825
8	spaCy	32,785
9	pytorch-lightning	30,432
10	ML-From-Scratch	28,738
11	data-science-ipython-notebooks	28,539
12	d2l-en	26,601
13	dash	24,233
14	best-of-ml-python	22,786
15	pandas-ai	22,534
16	matplotlib	21,982
17	recommenders	21,100
18	Prefect	20,783
19	marimo	17,232
20	ipython	16,601
21	gensim	16,267
22	dvc	15,089
23	dagster	14,423

Python Data Science

Top 23 Python Data Science Projects

Python Data Science discussion

Python Data Science related posts

TabPFN-2.5 – SOTA foundation model for tabular data

We're open-sourcing the successor of Jupyter notebook

Fixing Type Hints for Callable Objects with Custom Signatures in Dagster

Strengthening Open-Source Integrity: My First Contribution to spaCy

LLMZ25-2 Review : Construyendo Interfaces LLM con Streamlit

LLMZ25-1 Review : Streamlit La Herramienta Perfecta para Interfaces de Proyectos LLM

Installing FFCV and Fastxtend on Windows with Micromamba and MSVC

Index

Did you know that Python is the 2nd most popular programming language based on number of references?

Did you know that Python is
the 2nd most popular programming language
based on number of references?