Skip to main content

No project description provided

Project description

doctext

Extract text from all kinds of documents. Delegates the heavylifting to other libraries and tools like Apache Tika, tesseract and many more.

Usage

#!/usr/bin/env python
from doctext.Processor import Processor

p = Processor()
print(p.run('/Users/me/some.pptx'))

# specify the language for tesseract
p = Processor()
print(p.run('/Users/me/some-german.png', tesseract_lang='deu'))

# or with Whisper (see https://openai.com/pricing)
p = Processor(openai_api_key='your-openai-api-key')
print(p.run('/Users/me/some.m4a'))

Introduction

Why yet another library for extracting text from documents? Because textract seems to be more or less abandoned and requires some outdated versions of dependencies. Also it does not support all the file formats I need. Apache Tika is great but surprisingly did not support some of the file formats I needed. So I decided to write a wrapper around a wrapper.

Installation

pip install doctext
brew install ffmpeg imagemagick poppler libheif dcraw

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctext-0.1.4.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

doctext-0.1.4-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file doctext-0.1.4.tar.gz.

File metadata

  • Download URL: doctext-0.1.4.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for doctext-0.1.4.tar.gz
Algorithm Hash digest
SHA256 7261cc06569c4e7e366b46886c3293ccee4f1237ae0f55c25aef72be98b9152e
MD5 c663b70709ed069af06cb3c838e178c9
BLAKE2b-256 cf59197e430ff10e4bad22e9fe4cf3fe441e51dcc7e1f4520bded0ec6bf78e68

See more details on using hashes here.

File details

Details for the file doctext-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: doctext-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for doctext-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5c27fc63a9d194d45e8bd7428ed331802a3da90d316383238237c104660e7c6a
MD5 2545e6ba5c20e411fe6028400b03e76b
BLAKE2b-256 444252a5aeb8164dad2eaea4cef19c7545d9b035cdbcf0f5ca32d74a7532adee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page