InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →
Top 23 Python Data Science Projects
-
Start Simple, Build Confidence Project: Scikit-learn After the intense first experience with BEHAVIOR-1K, I needed something more approachable. I went straight to Scikit-learn's good first issue label and found a task that seemed manageable: changing relative imports to absolute imports in Cython files. From this
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
Keras 3 multi-backend
-
Pandas
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Project mention: Node.js vs Python: Real Benchmarks, Performance Insights, and Scalability Analysis | dev.to | 2025-10-04data analytics stacks (Pandas)
-
Apache Airflow - Apache's Airflow project is a popular workflow system that supports DAG-based tasks and precise scheduling. It's an extensible Python project that supports several different providers and job executors, including Kubernetes.
-
Project mention: How to Build a RAG Solution with Llama Index, ChromaDB, and Ollama | dev.to | 2025-11-04
With a few lines of Python, you can build a basic retrieval-augmented generation (RAG) solution, but it doesn’t stop here. You can extend this project to search for multiple web pages, load large documents, add a simple web UI using either Streamlit or Anvil, or even experiment with different models in Ollama.
-
Project mention: The Ultimate Guide to Building Stunning AI Apps For Beginners - Gradio | dev.to | 2025-11-14
Why Gradio is the New Superpower for Every AI Learner in 2025
-
Ray
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Not currently, but it is being worked on https://github.com/ray-project/ray/issues/53976.
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
Project mention: Strengthening Open-Source Integrity: My First Contribution to spaCy | dev.to | 2025-10-28
🔗 Pull Request: #13877 — Remove spaCy Quickstart from Universe/Courses due to spam redirect
-
pytorch-lightning
Pretrain, finetune ANY AI model of ANY size on 1 or 10,000+ GPUs with zero code changes.
-
ML-From-Scratch
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.
So.. which Open Source project did you chose? https://github.com/eriklindernoren/ML-From-Scratch Okay, I know I said much things about moving out of comfort zone but I am still bit scared and worried. So, I've decided to start from an area that I am familiar with. First of all, this open source project is called "ML-From-Scratch". It's a learning resource that demystifies machine learning by showing the fundamental code for a wide range of models. Past few months, I have been studying Machin Learning with Data science. This project reveals what is actually going on behind the libraries and algorithms so people can understand the core functionality of machine learning (ML). This choice feels right for my "coder to developer" journey. As I contribute to the project, I will be exposed to deeper knowledge of machine learning.
-
data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
-
d2l-en
Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
-
Cloud Deployment: Dash apps can be deployed to Dash Enterprise or Heroku.
-
Project mention: A ranked list of machine learning Python libraries. Updated weekly | news.ycombinator.com | 2025-01-31
-
pandas-ai
Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.
-
Project mention: How to Get Started with Scikit-Learn: A Beginner-Friendly Guide to Machine Learning in Python | dev.to | 2025-04-24
As is the case with most Python libraries, it is open-source and free-to-use, making it easily accessible by anyone willing to learn machine learning, and it is built upon other open-source libraries within Python, like SciPy for advanced scientific operations, NumPy for efficient numerical computations, Matplotlib for data visualization, and Cython for increased efficiency and speed, similar to that of C/C++.
-
-
Project mention: Show HN: Flow – A Dynamic Task Engine for AI Agents Without DAG | news.ycombinator.com | 2024-12-02
- https://github.com/PrefectHQ/prefect
-
marimo
A reactive notebook for Python — run reproducible experiments, query with SQL, execute as a script, deploy as an app, and version with git. Stored as pure Python. All in a modern, AI-native editor.
Project mention: We're open-sourcing the successor of Jupyter notebook | news.ycombinator.com | 2025-11-04The successor to Jupyter notebook is Marimo, https://marimo.io/ because they are pure code, not code in json. First class everywhere.
-
ipython
Official repository for IPython itself. Other repos in the IPython organization contain things like the website, documentation builds, etc.
Pickling + unpickling the object is a neat trick to update objects to point to the new methods, but it's even more straightforward to just patch `obj.__class__ = reloaded_module.NewClass`. This is what ipython's autoreload extension used to do, though nowadays it's had some improvements over this approach: https://github.com/ipython/ipython/pull/14500
-
-
Project mention: Ask HN: What is the simplest data orchestration tool you've worked with? | news.ycombinator.com | 2025-03-21
-
Project mention: Fixing Type Hints for Callable Objects with Custom Signatures in Dagster | dev.to | 2025-10-28
Finding the Issue I was browsing through their GitHub issues, I found Issue #32574: "Callable object custom signatures are resolved incorrectly." Issue-32574 At first glance, I thought "Oh cool, this looks easy." But then I read the details and realized this was actually pretty interesting.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Data Science discussion
Python Data Science related posts
-
TabPFN-2.5 – SOTA foundation model for tabular data
-
We're open-sourcing the successor of Jupyter notebook
-
Fixing Type Hints for Callable Objects with Custom Signatures in Dagster
-
Strengthening Open-Source Integrity: My First Contribution to spaCy
-
LLMZ25-2 Review : Construyendo Interfaces LLM con Streamlit
-
LLMZ25-1 Review : Streamlit La Herramienta Perfecta para Interfaces de Proyectos LLM
-
Installing FFCV and Fastxtend on Windows with Micromamba and MSVC
-
A note from our sponsor - InfluxDB
www.influxdata.com | 15 Nov 2025
Index
What are some of the best open-source Data Science projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | scikit-learn | 63,971 |
| 2 | Keras | 63,551 |
| 3 | Pandas | 47,068 |
| 4 | Airflow | 43,200 |
| 5 | streamlit | 42,140 |
| 6 | gradio | 40,497 |
| 7 | Ray | 39,825 |
| 8 | spaCy | 32,785 |
| 9 | pytorch-lightning | 30,432 |
| 10 | ML-From-Scratch | 28,738 |
| 11 | data-science-ipython-notebooks | 28,539 |
| 12 | d2l-en | 26,601 |
| 13 | dash | 24,233 |
| 14 | best-of-ml-python | 22,786 |
| 15 | pandas-ai | 22,534 |
| 16 | matplotlib | 21,982 |
| 17 | recommenders | 21,100 |
| 18 | Prefect | 20,783 |
| 19 | marimo | 17,232 |
| 20 | ipython | 16,601 |
| 21 | gensim | 16,267 |
| 22 | dvc | 15,089 |
| 23 | dagster | 14,423 |