InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →
Top 23 Python Natural Language Processing Projects
-
transformers
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
python3.12 -m venv new_venv_312 source new_venv_312/bin/activate pip install --upgrade pip pip install https://github.com/huggingface/transformers/archive/main.zip torchaudio peft soundfile torchcodec ### and also pip install librosa
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
funNLP
中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌词库、汽车零件词库、连续英文切割、各种中文词向量、公司名字大全、古诗词库、IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据、百度中文问答数据集、句子相似度匹配算法集合、bert资源、文本生成&摘要相关工具、cocoNLP信息抽取工具、国内电话号码正则匹配、清华大学XLORE:中英文跨语言百科知识图谱、清华大学人工智能技术系列报告、自然语言生成、NLU太难了系列、自动对联数据及机器人、用户名黑名单列表、罪名法务名词及分类模型、微信公众号语料、cs224n深度学习自然语言处理课程、中文手写汉字识别、中文自然语言处理 语料/数据集、变量命名神器、分词语料库+代码、任务型对话英文数据集、ASR 语音数据集 + 基于深度学习的中文
-
crewAI
Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.
GitHub - crewAIInc/crewAI
-
HanLP
Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification
-
Project mention: Show HN: Mandarin Word Segmenter with Translation | news.ycombinator.com | 2025-02-04
Thanks for the kind words!
I'm using Jieba[0] because it hits a nice balance of fast and accurate. But I'm initializing it with a custom dictionary (~800k entries), and have added several layers of heuristic post-segmentation. For example, Jieba tends to split up chengyu into two words, but I've decided they should be displayed as a single word, since chengyu are typically a single entry in dictionaries.
[0] https://github.com/fxsjy/jieba
-
Project mention: Strengthening Open-Source Integrity: My First Contribution to spaCy | dev.to | 2025-10-28
🔗 Pull Request: #13877 — Remove spaCy Quickstart from Universe/Courses due to spam redirect
-
d2l-en
Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
Resume-Matcher
Improve your resumes with Resume Matcher. Get insights, keyword suggestions and tune your resumes to job descriptions.
Project mention: Ask HN: Someone has committed 20K+ LoC to a PR, exhausting my CI a& AI workflows | news.ycombinator.com | 2025-08-26I'm maintaining an OSS project, and someone raised a PR a few days earlier, and since then, 20K+ LoC has been added to the PR. There are two new accounts, but they lack details on how to contact them, only providing usernames.
PR: https://github.com/srbhr/Resume-Matcher/pull/497
Accounts:
-
NLP-progress
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
-
datasets
🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools
Hugging Face Datasets -- the library that lets you download and manage datasets from the Hugging Face Hub, as well as being a convenient vendor-neutral interface for your own datasets.
-
rasa
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
Project mention: Eliza Reanimated Published in IEEE Annals of the History of Computing | news.ycombinator.com | 2025-06-20Right before LLMs broke into the scene we had a few techniques I was aware of:
* Personality Forge uses a rules-based scripting approach [0]. This is basically ELIZA extended to take advantage of modern processing power.
* Rasa [1] used traditional NLP/NLU techniques and small-model ML to match intents and parse user requests. This is the same kind of tooling that Google/Alexa historically used, just without the voice layer and with more effort to keep the context in mind.
Rasa is actually open source [2], so you can poke around the internals to see how it's implemented. It doesn't look like it's changed architecture substantially since the pre-LLM days. Rhasspy [3] (also open source) uses similar techniques but in the voice assistant space rather than as a full chatbot.
[0] https://www.personalityforge.com/developers/how-to-build-cha...
[1] https://web.archive.org/web/20200801000000*/https://rasa.com... (old link because Rasa's marketing today is ambiguous about whether they're adding LLMs now).
[2] https://github.com/RasaHQ/rasa
[3] https://rhasspy.readthedocs.io/en/latest/
-
Ciphey
⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡
-
Qwen
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Project mention: Running Qwen, Nearly as Powerful as DeepSeek, on a MacBook Pro | dev.to | 2025-02-05Qwen (Qwen GitHub Repository) has been gaining attention recently as a powerful open-source large language model (LLM). I decided to give it a spin on my MacBook Pro using Ollama, a platform designed for running local LLMs. While Qwen2.5-Max boasts the highest performance, my setup could only handle the smaller Qwen2.5 (32B) model. Here's what I found!
-
DocsGPT
Private AI platform for agents, assistants and enterprise search. Built-in Agent Builder, Deep research, Document analysis, Multi-model support, and API connectivity for agents.
Project mention: 15 AI tools that almost replace a full dev team but please don’t fire us yet | dev.to | 2025-05-03DocsGPT: Lets users query your docs using GPT.
-
-
camel
🐫 CAMEL: The first and the best multi-agent framework. Finding the Scaling Law of Agents. https://www.camel-ai.org
It seems like you might be confusing "research programs with things like "branding" and superficial terminology. Here, enjoy this thing clearly building on SoM and edited earlier this week: ideas https://github.com/camel-ai/camel/blob/master/camel/societie...
-
Project mention: What is the Most Effective AI Tool for App Development Today? | dev.to | 2025-08-17
At the core of many AI-powered applications are foundational models—large language models (LLMs) and APIs that provide the intelligence for features like natural language processing, image recognition, and decision-making. These tools serve as the brain of the app, processing inputs and generating outputs that feel intuitive and human-like.
-
Project mention: WhisperNER: Unified Open Named Entity and Speech Recognition | news.ycombinator.com | 2024-11-21
only the last string is a LOC named entity. Of course you can change definitions from the standard ones if you like, but then you should be careful not to compare with tools that use the original standard definition of NER such as flairNLP [1].
[1] https://github.com/flairNLP/flair?tab=readme-ov-file
-
-
-
-
-
Project mention: Gluon: a GPU programming language based on the same compiler stack as Triton | news.ycombinator.com | 2025-09-17
Amazon (+ Microsoft) already released a language for ML called gluon 8 years ago: https://aws.amazon.com/blogs/aws/introducing-gluon-a-new-lib...
autogluon is popular as well: https://github.com/autogluon/autogluon
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Natural Language Processing discussion
Python Natural Language Processing related posts
-
Learning to Model the World with Language
-
Using “ibm-granite/granite-speech-3.3–8b” 🪨 for ASR
-
Strengthening Open-Source Integrity: My First Contribution to spaCy
-
5 Ways to Detect AI Agent Hallucinations
-
Updating ASR examples in Hugging Face Transformers Hub datasets, clearer args, smoother Windows setup
-
A Simple Guide to Keyword Clustering with spaCy
-
Wikipedia survives while the rest of the internet breaks
-
A note from our sponsor - InfluxDB
www.influxdata.com | 15 Nov 2025
Index
What are some of the best open-source Natural Language Processing projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | transformers | 152,508 |
| 2 | funNLP | 77,063 |
| 3 | crewAI | 40,292 |
| 4 | HanLP | 35,840 |
| 5 | Jieba | 34,551 |
| 6 | spaCy | 32,785 |
| 7 | d2l-en | 26,601 |
| 8 | Resume-Matcher | 23,901 |
| 9 | NLP-progress | 22,963 |
| 10 | datasets | 20,844 |
| 11 | rasa | 20,840 |
| 12 | Ciphey | 20,165 |
| 13 | Qwen | 19,710 |
| 14 | DocsGPT | 17,365 |
| 15 | gensim | 16,267 |
| 16 | camel | 14,781 |
| 17 | NLTK | 14,382 |
| 18 | flair | 14,324 |
| 19 | MOSS | 12,049 |
| 20 | LLMSurvey | 11,956 |
| 21 | ludwig | 11,616 |
| 22 | doccano | 10,381 |
| 23 | autogluon | 9,564 |