A Python tool for extracting table of contents from EPUB files with hierarchical structure support.
- Multiple extraction methods support (NCX, epub_meta, OPF)
- Automatic best method selection
- Hierarchical TOC structure preservation
- Russian and English language support
- JSON output format
- Detailed logging
- EPUB file analysis reports
pip install epub_tocgit clone https://github.com/almazilaletdinov/epub_toc.git
cd epub_toc
pip install -e .python -c "import epub_toc; print(epub_toc.__version__)"All dependencies will be automatically installed with pip:
- epub_meta>=0.0.7
- lxml>=4.9.3
- beautifulsoup4>=4.12.2
- ebooklib>=0.18
- tika>=2.6
For development, additional dependencies can be installed with:
pip install -e .[dev]from epub_toc import EPUBTOCParser
# Create parser
parser = EPUBTOCParser('path/to/book.epub')
# Extract TOC
toc = parser.extract_toc()
# Print to console
parser.print_toc()
# Save to JSON
parser.save_toc_to_json('output.json')epub-toc path/to/book.epubTo analyze all EPUB files in tests/data/epub_samples directory:
python tests/integration/test_epub_analysis.pyAnalysis results are saved in reports/ directory:
epub_analysis_YYYYMMDD_HHMMSS.json- detailed report in JSON formatepub_analysis_YYYYMMDD_HHMMSS.txt- brief report in text formattoc/*.json- extracted TOCs for each EPUB file
Report structure:
-
JSON report contains:
- Overall statistics for all files
- Extraction methods success rate
- Detailed results for each file
- Links to extracted TOC files
-
Text report includes:
- Brief statistics
- Information about each file
- Paths to extracted TOCs
-
TOC files:
- Saved in
toc/subdirectory - Named as
book_name_toc.json - Contain complete TOC in JSON format
- Saved in
TOC is saved in JSON format with the following structure:
{
"metadata": {
"title": "Book Title",
"authors": ["Author 1", "Author 2"],
"publisher": "Publisher Name",
"publication_date": "2024-01-01",
"language": "en",
"description": "Book description",
"cover_image_path": "path/to/cover.jpg",
"isbn": "978-3-16-148410-0",
"rights": "Copyright information",
"series": "Series Name",
"series_index": 1,
"identifiers": {
"isbn13": "978-3-16-148410-0",
"uuid": "550e8400-e29b-41d4-a716-446655440000"
},
"subjects": ["Fiction", "Adventure"],
"file_size": 1234567,
"file_name": "book.epub"
},
"toc": [
{
"title": "Chapter 1",
"href": "chapter1.html",
"level": 0,
"children": [
{
"title": "Section 1.1",
"href": "chapter1.html#section1",
"level": 1,
"children": []
}
]
}
]
}All metadata fields are optional and will be omitted if not available in the EPUB file.
The module includes comprehensive test suites:
To run all tests with detailed reporting:
./run_tests.shThis will run:
- Installation tests (pip install/uninstall)
- Integration tests (EPUB parsing, TOC extraction)
- Unit tests (core functionality)
The test suite includes:
- Installation verification
- Package dependencies
- Russian books (NCX method)
- English books (epub_meta method)
- Files with different TOC structures
- Files of different sizes (from 400KB to 8MB)
Test results and coverage reports are generated in:
coverage.xml- Code coverage reportreports/- Test execution reportstests/data/epub_toc_json/- Generated TOC files
- Python 3.7+
- epub_meta>=0.0.7
- lxml>=4.9.3
- beautifulsoup4>=4.12.2
We welcome contributions! If you'd like to help:
- Fork the repository
- Create a branch for your changes
- Make changes and add tests
- Ensure all tests pass
- Create a Pull Request
See CONTRIBUTING.md for details.
If you discover a security vulnerability, please DO NOT create a public issue. Instead, send a report following the instructions in SECURITY.md
This project is licensed under the MIT License. See LICENSE file for details.
- Additional EPUB format support
- Improved complex hierarchical structure handling
- Integration with popular e-readers
- Web service API
- Additional language support