Top 23 Python Natural Language Processing Projects

transformers

1 222 152,508 10.0 Python

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Project mention: Using “ibm-granite/granite-speech-3.3–8b” 🪨 for ASR | dev.to | 2025-11-02

python3.12 -m venv new_venv_312 source new_venv_312/bin/activate pip install --upgrade pip pip install https://github.com/huggingface/transformers/archive/main.zip torchaudio peft soundfile torchcodec ### and also pip install librosa
Stream

getstream.io featured

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
funNLP

2 0 77,063 2.2 Python

中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌词库、汽车零件词库、连续英文切割、各种中文词向量、公司名字大全、古诗词库、IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据、百度中文问答数据集、句子相似度匹配算法集合、bert资源、文本生成&摘要相关工具、cocoNLP信息抽取工具、国内电话号码正则匹配、清华大学XLORE:中英文跨语言百科知识图谱、清华大学人工智能技术系列报告、自然语言生成、NLU太难了系列、自动对联数据及机器人、用户名黑名单列表、罪名法务名词及分类模型、微信公众号语料、cs224n深度学习自然语言处理课程、中文手写汉字识别、中文自然语言处理语料/数据集、变量命名神器、分词语料库+代码、任务型对话英文数据集、ASR 语音数据集 + 基于深度学习的中文
crewAI

3 15 40,292 9.8 Python

Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.

Project mention: My Top Open-Source AI Tools for Building Smarter in 2025 | dev.to | 2025-08-14

GitHub - crewAIInc/crewAI
HanLP

4 3 35,840 6.8 Python

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification
Jieba

5 8 34,551 0.0 Python

结巴中文分词

Project mention: Show HN: Mandarin Word Segmenter with Translation | news.ycombinator.com | 2025-02-04

Thanks for the kind words!
I'm using Jieba[0] because it hits a nice balance of fast and accurate. But I'm initializing it with a custom dictionary (~800k entries), and have added several layers of heuristic post-segmentation. For example, Jieba tends to split up chengyu into two words, but I've decided they should be displayed as a single word, since chengyu are typically a single entry in dictionaries.
[0] https://github.com/fxsjy/jieba
spaCy

6 116 32,785 7.4 Python

💫 Industrial-strength Natural Language Processing (NLP) in Python

Project mention: Strengthening Open-Source Integrity: My First Contribution to spaCy | dev.to | 2025-10-28

🔗 Pull Request: #13877 — Remove spaCy Quickstart from Universe/Courses due to spam redirect
d2l-en

7 6 26,601 2.9 Python

Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
InfluxDB

www.influxdata.com featured

InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
Resume-Matcher

8 14 23,901 9.3 Python

Improve your resumes with Resume Matcher. Get insights, keyword suggestions and tune your resumes to job descriptions.

Project mention: Ask HN: Someone has committed 20K+ LoC to a PR, exhausting my CI a& AI workflows | news.ycombinator.com | 2025-08-26

I'm maintaining an OSS project, and someone raised a PR a few days earlier, and since then, 20K+ LoC has been added to the PR. There are two new accounts, but they lack details on how to contact them, only providing usernames.
PR: https://github.com/srbhr/Resume-Matcher/pull/497
Accounts:
NLP-progress

9 17 22,963 6.5 Python

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
datasets

10 18 20,844 9.4 Python

🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

Project mention: Training with Big Data on Any Cloud | dev.to | 2025-06-20

Hugging Face Datasets -- the library that lets you download and manage datasets from the Hugging Face Hub, as well as being a convenient vendor-neutral interface for your own datasets.
rasa

11 19 20,840 4.6 Python

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Project mention: Eliza Reanimated Published in IEEE Annals of the History of Computing | news.ycombinator.com | 2025-06-20

Right before LLMs broke into the scene we had a few techniques I was aware of:
* Personality Forge uses a rules-based scripting approach [0]. This is basically ELIZA extended to take advantage of modern processing power.
* Rasa [1] used traditional NLP/NLU techniques and small-model ML to match intents and parse user requests. This is the same kind of tooling that Google/Alexa historically used, just without the voice layer and with more effort to keep the context in mind.
Rasa is actually open source [2], so you can poke around the internals to see how it's implemented. It doesn't look like it's changed architecture substantially since the pre-LLM days. Rhasspy [3] (also open source) uses similar techniques but in the voice assistant space rather than as a full chatbot.
[0] https://www.personalityforge.com/developers/how-to-build-cha...
[1] https://web.archive.org/web/20200801000000*/https://rasa.com... (old link because Rasa's marketing today is ambiguous about whether they're adding LLMs now).
[2] https://github.com/RasaHQ/rasa
[3] https://rhasspy.readthedocs.io/en/latest/
Ciphey

12 27 20,165 0.0 Python

⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡
Qwen

13 8 19,710 6.0 Python

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

Project mention: Running Qwen, Nearly as Powerful as DeepSeek, on a MacBook Pro | dev.to | 2025-02-05

Qwen (Qwen GitHub Repository) has been gaining attention recently as a powerful open-source large language model (LLM). I decided to give it a spin on my MacBook Pro using Ollama, a platform designed for running local LLMs. While Qwen2.5-Max boasts the highest performance, my setup could only handle the smaller Qwen2.5 (32B) model. Here's what I found!
DocsGPT

14 41 17,365 9.9 Python

Private AI platform for agents, assistants and enterprise search. Built-in Agent Builder, Deep research, Document analysis, Multi-model support, and API connectivity for agents.

Project mention: 15 AI tools that almost replace a full dev team but please don’t fire us yet | dev.to | 2025-05-03

DocsGPT: Lets users query your docs using GPT.
gensim

15 18 16,267 7.9 Python

Topic Modelling for Humans
camel

16 16 14,781 9.9 Python

🐫 CAMEL: The first and the best multi-agent framework. Finding the Scaling Law of Agents. https://www.camel-ai.org

Project mention: Revisiting Minsky's Society of Mind in 2025 | news.ycombinator.com | 2025-06-18

It seems like you might be confusing "research programs with things like "branding" and superficial terminology. Here, enjoy this thing clearly building on SoM and edited earlier this week: ideas https://github.com/camel-ai/camel/blob/master/camel/societie...
NLTK

17 70 14,382 9.2 Python

NLTK Source

Project mention: What is the Most Effective AI Tool for App Development Today? | dev.to | 2025-08-17

At the core of many AI-powered applications are foundational models—large language models (LLMs) and APIs that provide the intelligence for features like natural language processing, image recognition, and decision-making. These tools serve as the brain of the app, processing inputs and generating outputs that feel intuitive and human-like.
flair

18 10 14,324 8.7 Python

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Project mention: WhisperNER: Unified Open Named Entity and Speech Recognition | news.ycombinator.com | 2024-11-21

only the last string is a LOC named entity. Of course you can change definitions from the standard ones if you like, but then you should be careful not to compare with tools that use the original standard definition of NER such as flairNLP [1].
[1] https://github.com/flairNLP/flair?tab=readme-ov-file
MOSS

19 4 12,049 4.7 Python

An open-source tool-augmented conversational language model from Fudan University
LLMSurvey

20 3 11,956 7.3 Python

The official GitHub page for the survey paper "A Survey of Large Language Models".
ludwig

21 3 11,616 0.0 Python

Low-code framework for building custom LLMs, neural networks, and other AI models
doccano

22 13 10,381 3.6 Python

Open source annotation tool for machine learning practitioners.
autogluon

23 11 9,564 9.5 Python

Fast and Accurate ML in 3 Lines of Code

Project mention: Gluon: a GPU programming language based on the same compiler stack as Triton | news.ycombinator.com | 2025-09-17

Amazon (+ Microsoft) already released a language for ML called gluon 8 years ago: https://aws.amazon.com/blogs/aws/introducing-gluon-a-new-lib...
autogluon is popular as well: https://github.com/autogluon/autogluon
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Natural Language Processing discussion

Python Natural Language Processing related posts

Learning to Model the World with Language

1 project | news.ycombinator.com | 6 Nov 2025
Using “ibm-granite/granite-speech-3.3–8b” 🪨 for ASR

1 project | dev.to | 2 Nov 2025
Strengthening Open-Source Integrity: My First Contribution to spaCy

1 project | dev.to | 28 Oct 2025
5 Ways to Detect AI Agent Hallucinations

1 project | dev.to | 26 Oct 2025
Updating ASR examples in Hugging Face Transformers Hub datasets, clearer args, smoother Windows setup

1 project | dev.to | 30 Sep 2025
A Simple Guide to Keyword Clustering with spaCy

1 project | dev.to | 15 Sep 2025
Wikipedia survives while the rest of the internet breaks

1 project | news.ycombinator.com | 4 Sep 2025
A note from our sponsor - InfluxDB
www.influxdata.com | 15 Nov 2025

InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →

Index

What are some of the best open-source Natural Language Processing projects in Python? This list will help you:

#	Project	Stars
1	transformers	152,508
2	funNLP	77,063
3	crewAI	40,292
4	HanLP	35,840
5	Jieba	34,551
6	spaCy	32,785
7	d2l-en	26,601
8	Resume-Matcher	23,901
9	NLP-progress	22,963
10	datasets	20,844
11	rasa	20,840
12	Ciphey	20,165
13	Qwen	19,710
14	DocsGPT	17,365
15	gensim	16,267
16	camel	14,781
17	NLTK	14,382
18	flair	14,324
19	MOSS	12,049
20	LLMSurvey	11,956
21	ludwig	11,616
22	doccano	10,381
23	autogluon	9,564

Python Natural Language Processing

Top 23 Python Natural Language Processing Projects

Python Natural Language Processing discussion

Python Natural Language Processing related posts

Learning to Model the World with Language

Using “ibm-granite/granite-speech-3.3–8b” 🪨 for ASR

Strengthening Open-Source Integrity: My First Contribution to spaCy

5 Ways to Detect AI Agent Hallucinations

Updating ASR examples in Hugging Face Transformers Hub datasets, clearer args, smoother Windows setup

A Simple Guide to Keyword Clustering with spaCy

Wikipedia survives while the rest of the internet breaks

Index

Did you know that Python is the 2nd most popular programming language based on number of references?

Did you know that Python is
the 2nd most popular programming language
based on number of references?