Skip to main content

No project description provided

Project description

doctext

Extract text from all kinds of documents. Delegates the heavylifting to other libraries and tools like Apache Tika, tesseract and many more.

Usage

#!/usr/bin/env python
from doctext.Processor import Processor
import logging

# set logging level to INFO to see what's going on
logging.basicConfig(level=logging.INFO)

p = Processor()
print(p.run('/Users/me/some.pptx'))

# specify the language for tesseract
p = Processor()
print(p.run('/Users/me/some-german.png', tesseract_lang='deu'))

# or with Whisper (see https://openai.com/pricing)
p = Processor(openai_api_key='your-openai-api-key')
print(p.run('/Users/me/some.m4a'))

Introduction

Why yet another library for extracting text from documents? Because textract seems to be more or less abandoned and requires some outdated versions of dependencies. Also it does not support all the file formats I need. Apache Tika is great but surprisingly did not support some of the file formats I needed. So I decided to write a wrapper around a wrapper.

Installation

pip install doctext
brew install ffmpeg imagemagick poppler libheif dcraw

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctext-0.1.9.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

doctext-0.1.9-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file doctext-0.1.9.tar.gz.

File metadata

  • Download URL: doctext-0.1.9.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for doctext-0.1.9.tar.gz
Algorithm Hash digest
SHA256 a32a9bc1f4af4aec78a436f7266dfb22231d87679cc8202f665af915b8b054c7
MD5 11bf0aeec7aff6de76ed853895974007
BLAKE2b-256 6916cc13d3bc19f1d47b9b60a74344d634c6f8a84b097e7ee95726cb0d7264d3

See more details on using hashes here.

File details

Details for the file doctext-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: doctext-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for doctext-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 fb668dca0f6c41e77963f10ec0f5dc5f84f03e2722bf2b5d3c30f6eb35d11115
MD5 40a567e37d4b7c95614d3fb70406bef7
BLAKE2b-256 41824e3ac16c6828b5b73ad8fd14cc5b1bad123abc84ce8c12309cd39949956f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page