DocDump aims to allow for raw text data and document metadata to be easily extracted from a
range of commonly used document types, such as Word, PDF, PowerPoint, Excel, txt. DocDump acts as
a wrapper for a number of existing packages: PyPDF2, openpyxl, python-docx, python-pptx.
DocDump extracts all text as a single string, and does not preserve text structure. This makes it a useful tool in a natural language processing or search pipeline.
DocDump does not perform any preprocessing or normalisation of the extracted text.
DocDump requires Python 3.7+
pip install docdumpfrom docdump import doc_reader
document = doc_reader("sampleFile.docx")
text_dump = document.text
metadata = document.metadata
filetype = document.filetype
absolute_path = document.pathDistributed under the MIT License. See LICENSE for more information.
Grant Holtes - [email protected]
Project Link: https://github.com/Gholtes/docdump