A package to extract text from common document types.
Project description
DocDump
A package to extract text from common document types
DocDump aims to allow for raw text data and document metadata to be easily extracted from a
range of commonly used document types, such as Word, PDF, PowerPoint, Excel, txt. DocDump acts as
a wrapper for a number of existing packages: PyPDF2
, openpyxl
, python-docx
, python-pptx
.
DocDump extracts all text as a single string, and does not preserve text structure. This makes it a useful tool in a natural language processing or search pipeline.
DocDump does not perform any preprocessing or normalisation of the extracted text.
Getting Started
DocDump requires Python 3.7+
Installation
pip install docdump
Usage
from docdump import doc_reader
document = doc_reader("sampleFile.docx")
text_dump = document.text
metadata = document.metadata
filetype = document.filetype
absolute_path = document.path
License
Distributed under the MIT License. See LICENSE
for more information.
Contact
Grant Holtes - gwholtes@gmail.com
Project Link: https://github.com/Gholtes/docdump
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.