Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Python package that spits out text from your document files!

License

Notifications You must be signed in to change notification settings

fsecada01/TextSpitter

Repository files navigation

Project Logo

TEXTSPITTER.GIT

Transforming documents into insights, effortlessly and efficiently.

license last-commit repo-top-language repo-language-count

Built with the tools and technologies:

TOML Pytest Python GitHub%20Actions uv


Table of Contents


Overview

TextSpitter is a powerful developer tool designed to simplify document processing and enhance file handling capabilities across various formats.

Why TextSpitter?

This project streamlines the way developers interact with documents, ensuring a robust and efficient development experience. The core features include:

  • πŸ“¦ Robust Dependency Management: Ensures a stable development environment with essential libraries for seamless functionality.
  • πŸ“„ File Extraction Capabilities: Standardizes handling of text, CSV, DOCX, and PDF files for smooth integration.
  • πŸ› οΈ Enhanced Logging: Utilizes loguru for sophisticated error tracking, improving debugging and maintenance.
  • πŸš€ Automated Publishing: Streamlines the release process with GitHub Actions for continuous delivery.
  • πŸ–₯️ Code Quality Tools: Integrates black and ruff for consistent code formatting and linting.

Features

Component Details
βš™οΈ Architecture
  • Modular design for text processing
  • Utilizes a pipeline approach for data flow
πŸ”© Code Quality
  • Adheres to PEP 8 style guidelines
  • Includes type hints for better readability
πŸ“„ Documentation
  • Basic README file present
  • Inline comments for complex functions
πŸ”Œ Integrations
  • CI/CD with GitHub Actions
  • Package management via pip
🧩 Modularity
  • Core functionalities separated into modules
  • Reusable components for text manipulation
πŸ§ͺ Testing
  • Unit tests using pytest
  • Mocking capabilities with pytest-mock
⚑️ Performance
  • Efficient handling of large text files
  • Optimized algorithms for text parsing
πŸ›‘οΈ Security
  • Input validation to prevent injection attacks
  • Dependencies regularly updated for security patches
πŸ“¦ Dependencies
  • Core libraries: pymupdf, lxml, python-docx
  • Development tools: pytest, loguru
πŸš€ Scalability
  • Designed to handle increasing text data volumes
  • Supports multi-threading for concurrent processing

---

## Project Structure

```sh
└── TextSpitter.git/
    β”œβ”€β”€ .github
    β”‚   └── workflows
    β”œβ”€β”€ _config.yml
    β”œβ”€β”€ core_requirements.in
    β”œβ”€β”€ core_requirements.txt
    β”œβ”€β”€ dev_requirements.in
    β”œβ”€β”€ dev_requirements.txt
    β”œβ”€β”€ LICENSE
    β”œβ”€β”€ pyproject.toml
    β”œβ”€β”€ readme-ai.md
    β”œβ”€β”€ README.md
    β”œβ”€β”€ requirements.txt
    β”œβ”€β”€ setup_py.backup
    β”œβ”€β”€ TextSpitter
    β”‚   β”œβ”€β”€ __init__.py
    β”‚   β”œβ”€β”€ core.py
    β”‚   β”œβ”€β”€ logger.py
    β”‚   └── main.py
    └── uv.lock

Project Index

TEXTSPITTER.GIT/
__root__
β¦Ώ __root__
File Name Summary
core_requirements.in - Defines essential dependencies for the project, ensuring a robust environment for document processing and testing
- By incorporating libraries such as loguru for logging, PyMuPDF and pypdf for PDF manipulation, and python-docx for Word document handling, it streamlines development and enhances functionality
- Additionally, it includes testing frameworks like pytest to facilitate effective testing practices, contributing to overall code quality and reliability.
core_requirements.txt - Defines essential dependencies for the project, ensuring that all necessary libraries are available for seamless functionality and testing
- By managing package versions, it facilitates a consistent development environment, supporting various components like logging, document processing, and testing frameworks
- This contributes to the overall stability and reliability of the codebase architecture, enabling efficient development and maintenance processes.
dev_requirements.in - Defines development dependencies for the project, ensuring a consistent environment for contributors
- By referencing core requirements and including essential tools like black for code formatting and ruff for linting, it streamlines the setup process
- This facilitates collaboration and enhances code quality across the codebase, ultimately supporting efficient development practices within the overall architecture.
dev_requirements.txt - Facilitates the management of development dependencies for the project by specifying required packages and their versions
- This ensures a consistent environment for developers, enhancing collaboration and reducing setup issues
- By automating the generation of this requirements file, it streamlines the process of maintaining and updating dependencies, ultimately supporting the overall architecture of the codebase focused on Jupyter-related functionalities.
LICENSE - MIT License facilitates the free use, modification, and distribution of the software, ensuring that users can leverage the codebase without restrictions
- It establishes the legal framework that protects both the authors and users, promoting collaboration and innovation within the project
- By providing this license, the project encourages community engagement while limiting liability for the authors.
pyproject.toml - Configuration settings streamline the linting, formatting, and packaging processes for the text-extraction application, TextSpitter
- By defining rules for code quality and style, it ensures consistency and maintainability across the codebase
- Additionally, it specifies project metadata, dependencies, and development tools, facilitating a smooth development experience and enhancing collaboration among contributors.
requirements.txt - Manages project dependencies for a Python application by specifying required libraries and their versions
- Ensures compatibility and stability within the codebase, facilitating the installation of essential packages such as lxml, pymupdf, pypdf2, and python-docx
- This structure supports document processing and manipulation functionalities, contributing to the overall architectures efficiency and reliability.
_config.yml - Configures the Jekyll site to utilize the Cayman theme, enhancing the visual presentation and user experience of the project
- This setup plays a crucial role in defining the overall aesthetic and layout of the website, ensuring a cohesive and appealing design that aligns with the projects branding and purpose within the broader codebase architecture.
TextSpitter
β¦Ώ TextSpitter
File Name Summary
core.py - FileExtractor serves as a core component for extracting and processing content from various file types, including text, CSV, DOCX, and PDF formats
- It standardizes file handling by providing methods to read and decode file contents while managing different input types
- This functionality enhances the overall architecture by enabling seamless integration of file processing capabilities within the broader application ecosystem.
logger.py - Enhancing application reliability through robust logging capabilities, the logger module facilitates a transition from basic print statements to a more sophisticated error capturing mechanism
- By integrating the loguru library, it ensures that error tracking is efficient and organized, ultimately contributing to improved debugging and maintenance across the entire codebase architecture.
main.py - WordLoader serves as a central component in the application, facilitating the loading and processing of various file types through its integration with the FileExtractor
- By determining the appropriate extraction method based on file extensions and MIME types, it enhances the systems capability to handle diverse text formats, ensuring a seamless user experience while adhering to object-oriented design principles for future scalability.
.github
β¦Ώ .github
workflows
β¦Ώ .github.workflows
File Name Summary
python-publish.yml - Automates the process of publishing a Python package to a package registry upon the creation of a release
- By leveraging GitHub Actions, it ensures that the package is built and uploaded seamlessly, enhancing the overall workflow efficiency within the project
- This integration supports continuous delivery practices, allowing for streamlined updates and distribution of the software.

Getting Started

Prerequisites

This project requires the following dependencies:

  • Programming Language: Python
  • Package Manager: Pip, Uv

Installation

Build TextSpitter.git from the source and intsall dependencies:

  1. Clone the repository:

    git clone https://github.com/fsecada01/TextSpitter.git
  2. Navigate to the project directory:

    cd TextSpitter
  3. Install the dependencies:

    pip install -r core_requirements.txt dev_requirements.txt

    Using uv:

    uv sync --all-extras --dev

Usage

Run the project with:

Using pip:

python {entrypoint}

Using uv:

uv run python {entrypoint}

Testing

Textspitter.git uses the pytest test framework. Run the test suite with:

Using pip:

pytest

Using uv:

uv run pytest tests/

Roadmap

  • spruce up documentation
  • Add stream functionality for s3-based file reading
  • expand functionality to other file types (e.g., code files, improved CSV handling)
  • TDB

Contributing

Contributing Guidelines
  1. Fork the Repository: Start by forking the project repository to your github account.
  2. Clone Locally: Clone the forked repository to your local machine using a git client.
    git clone https://github.com/fsecada01/TextSpitter.git
  3. Create a New Branch: Always work on a new branch, giving it a descriptive name.
    git checkout -b new-feature-x
  4. Make Your Changes: Develop and test your changes locally.
  5. Commit Your Changes: Commit with a clear message describing your updates.
    git commit -m 'Implemented new feature x.'
  6. Push to github: Push the changes to your forked repository.
    git push origin new-feature-x
  7. Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
  8. Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!
Contributor Graph


License

Textspitter.git is protected under the LICENSE License. For more details, refer to the LICENSE file.


Acknowledgments

  • Credit contributors, inspiration, references, etc.


About

Python package that spits out text from your document files!

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages