Skip to main content

A package to extract text from common document types.

Project description

MIT License

DocDump

A package to extract text from common document types

DocDump aims to allow for raw text data and document metadata to be easily extracted from a range of commonly used document types, such as Word, PDF, PowerPoint, Excel, txt. DocDump acts as a wrapper for a number of existing packages: PyPDF2, openpyxl, python-docx, python-pptx.

DocDump extracts all text as a single string, and does not preserve text structure. This makes it a useful tool in a natural language processing or search pipeline.

DocDump does not perform any preprocessing or normalisation of the extracted text.

Getting Started

DocDump requires Python 3.7+

Installation

pip install docdump

Usage

from docdump import doc_reader

document = doc_reader("sampleFile.docx")

text_dump = document.text
metadata = document.metadata
filetype = document.filetype
absolute_path = document.path

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Grant Holtes - gwholtes@gmail.com

Project Link: https://github.com/Gholtes/docdump

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docdump-1.0.4.tar.gz (4.3 kB view details)

Uploaded Source

File details

Details for the file docdump-1.0.4.tar.gz.

File metadata

  • Download URL: docdump-1.0.4.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.9.0

File hashes

Hashes for docdump-1.0.4.tar.gz
Algorithm Hash digest
SHA256 0ffcc00b718dde249cac1418b898ddac567deedc0b947c87b2edcc5a283e1887
MD5 5879fb684227803465372a9580851dfd
BLAKE2b-256 b1d7a715cef05f1d5b60d2cc5e94aaf8c317468c04731d2e1701110d2d7b4f48

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page