A package to extract text from common document types.
Project description
DocDump
A package to extract text from common document types
DocDump aims to allow for raw text data and document metadata to be easily extracted from a
range of commonly used document types, such as Word, PDF, PowerPoint, Excel, txt. DocDump acts as
a wrapper for a number of existing packages: PyPDF2
, openpyxl
, python-docx
, python-pptx
.
DocDump extracts all text as a single string, and does not preserve text structure. This makes it a useful tool in a natural language processing or search pipeline.
DocDump does not perform any preprocessing or normalisation of the extracted text.
Getting Started
DocDump requires Python 3.7+
Installation
pip install docdump
Usage
from docdump import doc_reader
document = doc_reader("sampleFile.docx")
text_dump = document.text
metadata = document.metadata
filetype = document.filetype
absolute_path = document.path
License
Distributed under the MIT License. See LICENSE
for more information.
Contact
Grant Holtes - gwholtes@gmail.com
Project Link: https://github.com/Gholtes/docdump
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file docdump-1.0.4.tar.gz
.
File metadata
- Download URL: docdump-1.0.4.tar.gz
- Upload date:
- Size: 4.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ffcc00b718dde249cac1418b898ddac567deedc0b947c87b2edcc5a283e1887 |
|
MD5 | 5879fb684227803465372a9580851dfd |
|
BLAKE2b-256 | b1d7a715cef05f1d5b60d2cc5e94aaf8c317468c04731d2e1701110d2d7b4f48 |