Talmud NLP Indexer

An NLP-powered system for programmatically analyzing, indexing, and tagging the Talmud Bavli page by page.

Project Overview

This project uses modern natural language processing techniques to analyze and tag the content of the Talmud Bavli (Babylonian Talmud).

It leverages both the original Hebrew/Aramaic text and English translations available through the Sefaria API.

See also my overview at my blog: "Mapping the Talmud: Scalable Natural Language Processing (NLP) for Named Entities, Topics, and Tags in the Talmudic Corpus" (May 04, 2025)

Includes extensive curated gazetteers:

Initial Focus

Tractate Berakhot (first 10 pages: 2a-7a)
Developing core NLP pipelines for Talmudic text (processor.py)
Fetching data from Sefaria API (api.py)
Creating a basic tagging system (tagging.py)
Orchestrating the process (main.py)

Features

Automated fetching of Talmudic text via Sefaria API
Bilingual processing (Hebrew using AlephBERT via transformers, English using spacy)
Text analysis including named entity recognition, noun phrase extraction (English), and embeddings (Hebrew)
Basic topic modeling (using scikit-learn) and keyword/entity-based tag generation, including matching against expanded Talmudic and Biblical gazetteers (names, places, concepts) and integration of topic modeling results. Logic improved to check name gazetteers before assigning place tags to reduce misclassification.
Storage of processed results as JSON files (data/ directory)
Generation of human-readable Markdown summaries (data/ directory) including tags, italicized words, and annotated text with prioritized gazetteer tags and exclusion of less relevant spaCy labels (e.g., ORG, WORK_OF_ART).

Setup

Clone the repository.
Install dependencies:
```
pip install -r requirements.txt
```

Running

Execute the main script to process the default range (Berakhot 2a-7a):

python main.py

Results will be saved as JSON and Markdown files in the data/ directory.

Testing

This project uses pytest for unit testing.

Ensure development dependencies are installed (included in requirements.txt):
```
pip install -r requirements.txt 
```
Run the tests from the root directory:
```
python -m pytest
```

Tests for API interaction, text processing, and tagging logic are located in the tests/ directory. Mocks are used to isolate components and avoid external dependencies (like network calls or loading large models) during testing.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
tests		tests
.gitignore		.gitignore
CLAUDE_GUIDE.md		CLAUDE_GUIDE.md
LICENSE		LICENSE
README.md		README.md
STATUS.md		STATUS.md
api.py		api.py
generate_sections_demo.py		generate_sections_demo.py
init.py		init.py
main.py		main.py
processor.py		processor.py
requirements.txt		requirements.txt
tagging.py		tagging.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Talmud NLP Indexer

Project Overview

Initial Focus

Features

Setup

Running

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

EzraBrand/talmud-nlp-indexer

Folders and files

Latest commit

History

Repository files navigation

Talmud NLP Indexer

Project Overview

Initial Focus

Features

Setup

Running

Testing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages