Worlde is a powerful NLP-based text analysis tool that extracts and counts parts of speech (nouns, verbs, and adjectives) from sentences, with support for highlighted document generation.
- 📊 POS Word Counting: Extract and count nouns, verbs, and adjectives from sentences
- 🎨 Highlighted Documents: Generate DOCX files with color-coded POS highlighting
- 🔄 Multiple spaCy Models: Choose from different model sizes for accuracy vs. speed
- 📝 Structured Excel Output: Alphabetically organized word counts by POS category
- 🎯 Lemmatization: Automatic word normalization (plural→singular, conjugations→infinitive)
- 🚫 Stop Word Filtering: Removes common words for cleaner results
- 💪 Type-Safe: Full type annotations with mypy validation
- 🎁 Functional Error Handling: Clean Result pattern for robust error management
(English version)
The task is to divide all words (lexemes) into three main groups: nouns, verbs, and adjectives.
For nouns: singular and plural forms are considered one lexeme. For adjectives: all degrees of comparison are combined into one lexeme. For verbs: all tense, person, and aspect forms are counted as one lexeme.
You also need to count the number of occurrences for each lexeme within its group.
(Ukrainian version)
Завдання полягає в тому, щоб розподілити всі слова (лексеми) на три основні групи: іменники, дієслова та прикметники.
Для іменників: форми однини та множини вважаються однією лексемою. Для прикметників: усі ступені порівняння об'єднуються в одну лексему. Для дієслів: усі часові, особові та видові форми рахуються як одна лексема.
Потрібно також порахувати кількість слововживань для кожної лексеми в межах її групи.
Worlde supports multiple spaCy English models. Choose based on your needs:
| Model | Size | Speed | Accuracy | Use Case |
|---|---|---|---|---|
spacy-sm |
12 MB | ⚡⚡⚡ Fast | Good | Quick analysis |
spacy-md |
31 MB | ⚡⚡ Medium | Better | Balanced accuracy/speed |
spacy-lg |
382 MB | ⚡ Slower | Best | High accuracy needed |
spacy-trf |
436 MB | 🐌 Slowest | Excellent | Maximum accuracy |
- Python 3.13+
- uv package manager
-
Clone the repository:
-
Install dependencies using
uvand specify the spaCy model (from the table above):
uv sync --extra <spacy_model>To switch to a different model, you can use run the previous command, with a new model specified
Worlde provides two main commands: count for word counting and highlight for document highlighting.
Counts and exports POS-tagged words to Excel.
Basic usage:
uv run python -m app.main count data.xlsxWith options:
# Specify sheet name
uv run python -m app.main count data.xlsx --sheet-name sentences
# Use a different spaCy model
uv run python -m app.main count data.xlsx --model-prefix lg
# Custom output sheet name
uv run python -m app.main count data.xlsx --output-sheet analysisOutput: Creates a formatted Excel sheet with alphabetically organized word counts.
Example output:
| | Nouns | | Verbs | | Adjectives | |
|---|-------------|--------|-----------|--------|-------------|--------|
| A | adjective | 1 | | | | |
| n | | | analyze | 1 | | |
| C | count | 1 | | | | |
| D | document | 1 | | | | |
| E | | | | | english | 1 |
| x | | | explore | 1 | | |
| F | file | 1 | | | | |
| G | | | generate | 1 | | |
| l | glance | 1 | | | | |
| H | | | highlight | 1 | | |
| L | language | 1 | | | | |
| N | noun | 1 | | | | |
| P | pattern | 1 | | | | |
| e | | | | | perfect | 1 |
| R | research | 1 | | | | |
| S | sentence | 1 | | | | |
| p | speech | 1 | | | | |
| T | teaching | 1 | | | | |
| V | verb | 1 | | | | |
Data Structure:
- Column 1: Letter navigation (first letter uppercase, then second letter lowercase)
- Columns 2-3: Noun word and count
- Columns 4-5: Verb word and count
- Columns 6-7: Adjective word and count
Generate a DOCX file with POS-highlighted text.
Basic usage:
uv run python -m app.main highlight data.xlsx output.docxWith options:
# Specify sheet and model
uv run python -m app.main highlight data.xlsx highlighted.docx --sheet-name sentences --model-prefix mdOutput: DOCX file with:
- One paragraph per sentence
- Color-coded background highlighting:
- 🔵 Turquoise: Nouns
- 🟡 Yellow: Verbs
- 🟣 Pink: Adjectives
worlde/
├── app/
│ ├── main.py # CLI entry point with typer commands
│ ├── reader.py # Excel file reading with Result pattern
│ ├── pos_counter.py # POS word counting and lemmatization
│ ├── tokenizer.py # Sentence tokenization with POS tagging
│ ├── docx_writer.py # DOCX generation with highlighting
│ └── writer.py # Excel output formatting
├── docs/
│ └── pitch.png # Project banner
├── pyproject.toml # Project configuration and dependencies
└── README.md
Run linters:
# Check code style
uv run ruff check app/
# Type checking
uv run mypy app/
# Auto-format code
uv run ruff format app/- Result Pattern: All modules return
Result[T, str]for clean error handling - DataFrame-Based: Uses pandas DataFrames for efficient data manipulation
- Type-Safe: Full type annotations verified by mypy
- Functional: Minimal side effects, pure functions where possible
- Modular: Clear separation of concerns (reading, processing, writing)
MIT License
Contributions are welcome! Please ensure all code passes linting and type checks before submitting.
uv run ruff check app/ && uv run mypy app/