DocDump

A package to extract text from common document types

DocDump aims to allow for raw text data and document metadata to be easily extracted from a range of commonly used document types, such as Word, PDF, PowerPoint, Excel, txt. DocDump acts as a wrapper for a number of existing packages: PyPDF2, openpyxl, python-docx, python-pptx.

DocDump extracts all text as a single string, and does not preserve text structure. This makes it a useful tool in a natural language processing or search pipeline.

DocDump does not perform any preprocessing or normalisation of the extracted text.

Getting Started

DocDump requires Python 3.7+

Installation

pip install docdump

Usage

from docdump import doc_reader

document = doc_reader("sampleFile.docx")

text_dump = document.text
metadata = document.metadata
filetype = document.filetype
absolute_path = document.path

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Grant Holtes - [email protected]

Project Link: https://github.com/Gholtes/docdump

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
docdump		docdump
.gitignore		.gitignore
LICENCE.txt		LICENCE.txt
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DocDump

A package to extract text from common document types

Getting Started

Installation

Usage

License

Contact

About

Uh oh!

Releases

Packages

Languages

License

Gholtes/docdump

Folders and files

Latest commit

History

Repository files navigation

DocDump

A package to extract text from common document types

Getting Started

Installation

Usage

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages