Skip to main content

A Python library for extracting text content from any document format.

Project description

Any document Extractor

A Python library for extracting text content from any document format.

Features

  • Supports multiple document formats (PPTX, DOCX, PDF, XLSX.)
  • Returns clean extracted text

Installation

pip install any-document-extractor

Usage

Basic usage example:

from anydocumentextractor import DocumentExtractor


def main(fp: str):
    extra = DocumentExtractor(fp)
    return extra.extract()


if __name__ == '__main__':
    fp = 'text.docx'  # Can be any supported document
    content = main(fp)
    print(content)

Supported Formats

  • Microsoft Office: PPTX, DOCX, XLSX
  • OpenDocument: ODT, ODP
  • PDF documents
  • Plain text files
  • And more...

build to PYPI

rm -rf dist/ build/ *.egg-info/
python setup.py sdist bdist_wheel
twine upload dist/*

License

MIT License - Free for commercial and personal use.

You can customize this further by adding:

  • More detailed installation instructions
  • Specific version requirements
  • Advanced usage examples
  • Error handling documentation
  • Contribution guidelines
  • Project status badges

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

any_document_extractor-0.1.3.tar.gz (6.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

any_document_extractor-0.1.3-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file any_document_extractor-0.1.3.tar.gz.

File metadata

  • Download URL: any_document_extractor-0.1.3.tar.gz
  • Upload date:
  • Size: 6.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for any_document_extractor-0.1.3.tar.gz
Algorithm Hash digest
SHA256 9d197d97fc5a6d60ab57054072ddf27926af16867fd1a82da9eb403685449049
MD5 7f2adce7f4ec890efecf99f199a6c07d
BLAKE2b-256 0f3dba1722b8cfdbc641f9c046ce3d9f0e134ee8da3531e7fa13246d71fad68d

See more details on using hashes here.

File details

Details for the file any_document_extractor-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for any_document_extractor-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a4a8e8e4dcf628225c955920edc99f5864a3fa542902e9f62ee1d4b8933d0474
MD5 2ef508cbe0c12e94f0e516a15bc97e43
BLAKE2b-256 d07b2dafa571473a74591566c06bba3a392a943dc35ec3f3a7af6b87a8b325d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page