- Table of Contents
- Overview
- Features
- Project Structure
- Getting Started
- Roadmap
- Contributing
- License
- Acknowledgments
TextSpitter is a powerful developer tool designed to simplify document processing and enhance file handling capabilities across various formats.
Why TextSpitter?
This project streamlines the way developers interact with documents, ensuring a robust and efficient development experience. The core features include:
- π¦ Robust Dependency Management: Ensures a stable development environment with essential libraries for seamless functionality.
- π File Extraction Capabilities: Standardizes handling of text, CSV, DOCX, and PDF files for smooth integration.
- π οΈ Enhanced Logging: Utilizes loguru for sophisticated error tracking, improving debugging and maintenance.
- π Automated Publishing: Streamlines the release process with GitHub Actions for continuous delivery.
- π₯οΈ Code Quality Tools: Integrates black and ruff for consistent code formatting and linting.
| Component | Details | |
|---|---|---|
| βοΈ | Architecture |
|
| π© | Code Quality |
|
| π | Documentation |
|
| π | Integrations |
|
| π§© | Modularity |
|
| π§ͺ | Testing |
|
| β‘οΈ | Performance |
|
| π‘οΈ | Security |
|
| π¦ | Dependencies |
|
| π | Scalability |
|
---
## Project Structure
```sh
βββ TextSpitter.git/
βββ .github
β βββ workflows
βββ _config.yml
βββ core_requirements.in
βββ core_requirements.txt
βββ dev_requirements.in
βββ dev_requirements.txt
βββ LICENSE
βββ pyproject.toml
βββ readme-ai.md
βββ README.md
βββ requirements.txt
βββ setup_py.backup
βββ TextSpitter
β βββ __init__.py
β βββ core.py
β βββ logger.py
β βββ main.py
βββ uv.lock
TEXTSPITTER.GIT/
__root__
β¦Ώ __root__
File Name Summary core_requirements.in - Defines essential dependencies for the project, ensuring a robust environment for document processing and testing
- By incorporating libraries such as loguru for logging, PyMuPDF and pypdf for PDF manipulation, and python-docx for Word document handling, it streamlines development and enhances functionality
- Additionally, it includes testing frameworks like pytest to facilitate effective testing practices, contributing to overall code quality and reliability.core_requirements.txt - Defines essential dependencies for the project, ensuring that all necessary libraries are available for seamless functionality and testing
- By managing package versions, it facilitates a consistent development environment, supporting various components like logging, document processing, and testing frameworks
- This contributes to the overall stability and reliability of the codebase architecture, enabling efficient development and maintenance processes.dev_requirements.in - Defines development dependencies for the project, ensuring a consistent environment for contributors
- By referencing core requirements and including essential tools like black for code formatting and ruff for linting, it streamlines the setup process
- This facilitates collaboration and enhances code quality across the codebase, ultimately supporting efficient development practices within the overall architecture.dev_requirements.txt - Facilitates the management of development dependencies for the project by specifying required packages and their versions
- This ensures a consistent environment for developers, enhancing collaboration and reducing setup issues
- By automating the generation of this requirements file, it streamlines the process of maintaining and updating dependencies, ultimately supporting the overall architecture of the codebase focused on Jupyter-related functionalities.LICENSE - MIT License facilitates the free use, modification, and distribution of the software, ensuring that users can leverage the codebase without restrictions
- It establishes the legal framework that protects both the authors and users, promoting collaboration and innovation within the project
- By providing this license, the project encourages community engagement while limiting liability for the authors.pyproject.toml - Configuration settings streamline the linting, formatting, and packaging processes for the text-extraction application, TextSpitter
- By defining rules for code quality and style, it ensures consistency and maintainability across the codebase
- Additionally, it specifies project metadata, dependencies, and development tools, facilitating a smooth development experience and enhancing collaboration among contributors.requirements.txt - Manages project dependencies for a Python application by specifying required libraries and their versions
- Ensures compatibility and stability within the codebase, facilitating the installation of essential packages such as lxml, pymupdf, pypdf2, and python-docx
- This structure supports document processing and manipulation functionalities, contributing to the overall architectures efficiency and reliability._config.yml - Configures the Jekyll site to utilize the Cayman theme, enhancing the visual presentation and user experience of the project
- This setup plays a crucial role in defining the overall aesthetic and layout of the website, ensuring a cohesive and appealing design that aligns with the projects branding and purpose within the broader codebase architecture.
TextSpitter
β¦Ώ TextSpitter
File Name Summary core.py - FileExtractor serves as a core component for extracting and processing content from various file types, including text, CSV, DOCX, and PDF formats
- It standardizes file handling by providing methods to read and decode file contents while managing different input types
- This functionality enhances the overall architecture by enabling seamless integration of file processing capabilities within the broader application ecosystem.logger.py - Enhancing application reliability through robust logging capabilities, the logger module facilitates a transition from basic print statements to a more sophisticated error capturing mechanism
- By integrating the loguru library, it ensures that error tracking is efficient and organized, ultimately contributing to improved debugging and maintenance across the entire codebase architecture.main.py - WordLoader serves as a central component in the application, facilitating the loading and processing of various file types through its integration with the FileExtractor
- By determining the appropriate extraction method based on file extensions and MIME types, it enhances the systems capability to handle diverse text formats, ensuring a seamless user experience while adhering to object-oriented design principles for future scalability.
.github
β¦Ώ .githubworkflows
β¦Ώ .github.workflows
File Name Summary python-publish.yml - Automates the process of publishing a Python package to a package registry upon the creation of a release
- By leveraging GitHub Actions, it ensures that the package is built and uploaded seamlessly, enhancing the overall workflow efficiency within the project
- This integration supports continuous delivery practices, allowing for streamlined updates and distribution of the software.
This project requires the following dependencies:
- Programming Language: Python
- Package Manager: Pip, Uv
Build TextSpitter.git from the source and intsall dependencies:
-
Clone the repository:
git clone https://github.com/fsecada01/TextSpitter.git
-
Navigate to the project directory:
cd TextSpitter -
Install the dependencies:
pip install -r core_requirements.txt dev_requirements.txt
Using uv:
uv sync --all-extras --dev
Run the project with:
Using pip:
python {entrypoint}Using uv:
uv run python {entrypoint}Textspitter.git uses the pytest test framework. Run the test suite with:
Using pip:
pytestUsing uv:
uv run pytest tests/- spruce up documentation
- Add stream functionality for s3-based file reading
- expand functionality to other file types (e.g., code files, improved CSV handling)
- TDB
- π¬ Join the Discussions: Share your insights, provide feedback, or ask questions.
- π Report Issues: Submit bugs found or log feature requests for the
TextSpitter.gitproject. - π‘ Submit Pull Requests: Review open PRs, and submit your own PRs.
Contributing Guidelines
- Fork the Repository: Start by forking the project repository to your github account.
- Clone Locally: Clone the forked repository to your local machine using a git client.
git clone https://github.com/fsecada01/TextSpitter.git
- Create a New Branch: Always work on a new branch, giving it a descriptive name.
git checkout -b new-feature-x
- Make Your Changes: Develop and test your changes locally.
- Commit Your Changes: Commit with a clear message describing your updates.
git commit -m 'Implemented new feature x.' - Push to github: Push the changes to your forked repository.
git push origin new-feature-x
- Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
- Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!
Textspitter.git is protected under the LICENSE License. For more details, refer to the LICENSE file.
- Credit
contributors,inspiration,references, etc.