Convert documents into structured JSON effortlessly.
A Python library for extracting text from various document formats and structuring it hierarchically into JSON.
- β Extract text from PDFs, DOCX, TXT, RTF, ODT, MD, and images.
- πΌοΈ OCR support for scanned documents and images.
- βοΈ Flexible configuration using regex patterns and field mapping.
- π³ Nested hierarchical structure output in JSON.
- β¨ Explicit leaf-level control using
is_leaf=True. - π Built-in validations to catch config mistakes (regex, hierarchy, field conflicts).
- π§ͺ Comprehensive pytest suite with coverage reporting.
pip install doc23To enable OCR:
sudo apt install tesseract-ocr
pip install pytesseractfrom doc23 import extract_text
# Extract text from any supported document
text = extract_text("document.pdf", scan_or_image="auto")
print(text)from doc23 import Doc23, Config, LevelConfig
config = Config(
root_name="art_of_war",
sections_field="chapters",
description_field="description",
levels={
"chapter": LevelConfig(
pattern=r"^CHAPTER\s+([IVXLCDM]+)\n(.+)$",
name="chapter",
title_field="title",
description_field="description",
sections_field="paragraphs"
),
"paragraph": LevelConfig(
pattern=r"^(\d+)\.\s+(.+)$",
name="paragraph",
title_field="number",
description_field="text",
is_leaf=True
)
}
)
with open("art_of_war.txt") as f:
text = f.read()
doc = Doc23(text, config)
structure = doc.prune()
print(structure["chapters"][0]["title"]) # β I{
"description": "",
"chapters": [
{
"type": "chapter",
"title": "I",
"description": "Laying Plans",
"paragraphs": [
{
"type": "paragraph",
"number": "1",
"text": "Sun Tzu said: The art of war is of vital importance to the State."
}
]
}
]
}Use Config and LevelConfig to define how your document is parsed:
| Field | Purpose |
|---|---|
pattern |
Regex to match each level |
title_field |
Field to assign the first regex group |
description_field |
(Optional) Field for second group |
sections_field |
(Optional) Where sublevels go |
paragraph_field |
(Optional) Where text/nodes go if leaf |
is_leaf |
(Optional) Forces this level to be terminal |
| Fields Defined | Required Groups in Regex |
|---|---|
title_field only |
β₯1 |
title_field + description_field |
β₯2 |
title_field + paragraph_field |
β₯1 (second group optional) |
doc23 consists of several key components:
Doc23 (core.py)
βββ Extractors (extractors/)
β βββ PDFExtractor
β βββ DocxExtractor
β βββ TextExtractor
β βββ ...
βββ Config (config_tree.py)
β βββ LevelConfig
βββ Gardener (gardener.py)
- Doc23: Main entry point, handles file detection and orchestration
- Extractors: Convert various document types to plain text
- Config: Defines how to structure the document hierarchy
- Gardener: Parses text and builds the JSON structure
The library validates your config when creating Doc23:
- β Ensures all parents exist.
- π Detects circular relationships.
β οΈ Checks field name reuse.- π§ͺ Verifies group counts match pattern.
If any issue is found, a ValueError will be raised immediately.
The library includes a comprehensive test suite covering various scenarios:
def test_gardener_initialization():
config = Config(
root_name="document",
sections_field="sections",
description_field="description",
levels={
"book": LevelConfig(
pattern=r"^BOOK\s+(.+)$",
name="book",
title_field="title",
description_field="description",
sections_field="sections"
),
"article": LevelConfig(
pattern=r"^ARTICLE\s+(\d+)\.\s*(.*)$",
name="article",
title_field="title",
description_field="content",
paragraph_field="paragraphs",
parent="book"
)
}
)
gardener = Gardener(config)
assert gardener.leaf == "article"def test_prune_basic_structure():
config = Config(
root_name="document",
sections_field="sections",
description_field="description",
levels={
"book": LevelConfig(
pattern=r"^BOOK\s+(.+)$",
name="book",
title_field="title",
description_field="description",
sections_field="sections"
),
"article": LevelConfig(
pattern=r"^ARTICLE\s+(\d+)\.\s*(.*)$",
name="article",
title_field="title",
description_field="content",
paragraph_field="paragraphs",
parent="book"
)
}
)
gardener = Gardener(config)
text = """BOOK First Book
This is a description
ARTICLE 1. First article
This is article content
More content"""
result = gardener.prune(text)
assert result["sections"][0]["title"] == "First Book"
assert result["sections"][0]["sections"][0]["paragraphs"] == ["This is article content", "More content"]def test_prune_empty_document():
config = Config(
root_name="document",
sections_field="sections",
description_field="description",
levels={}
)
gardener = Gardener(config)
result = gardener.prune("")
assert result["sections"] == []def test_prune_with_free_text():
config = Config(
root_name="document",
sections_field="sections",
description_field="description",
levels={
"title": LevelConfig(
pattern=r"^TITLE\s+(.+)$",
name="title",
title_field="title",
description_field="description",
sections_field="sections"
)
}
)
gardener = Gardener(config)
text = """This is free text at the top level
TITLE First Title
Title description"""
result = gardener.prune(text)
assert result["description"] == "This is free text at the top level"Run tests with:
python -m pytest tests/Make sure Tesseract is installed and accessible in your PATH.
Different document formats may require specific libraries. Check your dependencies:
- PDF: pdfplumber, pdf2image
- DOCX: docx2txt
- ODT: odf
Test your patterns with tools like regex101.com and ensure you have the correct number of capture groups.
- Python 3.8+
- Tested on Linux, macOS, and Windows
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
MIT
For advanced patterns, dynamic configs, exception handling and OCR examples, see: