Skip to main content

No project description provided

Project description

doctext

Extract text from all kinds of documents. Delegates the heavylifting to other libraries and tools like Apache Tika, tesseract and many more.

Usage

#!/usr/bin/env python
from doctext.Processor import Processor

p = Processor()
print(p.run('/Users/me/some.pptx'))

# specify the language for tesseract
p = Processor()
print(p.run('/Users/me/some-german.png', tesseract_lang='deu'))

# or with Whisper (see https://openai.com/pricing)
p = Processor(openai_api_key='your-openai-api-key')
print(p.run('/Users/me/some.m4a'))

Introduction

Why yet another library for extracting text from documents? Because textract seems to be more or less abandoned and requires some outdated versions of dependencies. Also it does not support all the file formats I need. Apache Tika is great but surprisingly did not support some of the file formats I needed. So I decided to write a wrapper around a wrapper.

Installation

pip install doctext
brew install ffmpeg imagemagick poppler libheif dcraw

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctext-0.1.5.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

doctext-0.1.5-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file doctext-0.1.5.tar.gz.

File metadata

  • Download URL: doctext-0.1.5.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for doctext-0.1.5.tar.gz
Algorithm Hash digest
SHA256 120e0b105b474273f8af54333a059562f6a5e301d9da5cad14fc940d5e5f3301
MD5 582b91de539a695e2b27eba511b612b2
BLAKE2b-256 040adec46b8b2ece4d35311985eddce22044b1220ef23cdf3c183764d6772dc4

See more details on using hashes here.

File details

Details for the file doctext-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: doctext-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for doctext-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a8519a1f0a6c38578048f3797dc265299372479262759cf3677d89cf3102af7f
MD5 ce889b5be8524252f0550d8bdc5be9fe
BLAKE2b-256 8b161cc085aec1ce02bf81ed97c5d01c7fc72904305e31f53fafe39464a19012

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page