Skip to main content

No project description provided

Project description

doctext

Extract text from all kinds of documents. Delegates the heavylifting to other libraries and tools like Apache Tika, tesseract and many more.

Usage

#!/usr/bin/env python
from doctext.Processor import Processor
import logging

# set logging level to INFO to see what's going on
logging.basicConfig(level=logging.INFO)

p = Processor()
print(p.run('/Users/me/some.pptx'))

# specify the language for tesseract
p = Processor()
print(p.run('/Users/me/some-german.png', tesseract_lang='deu'))

# or with Whisper (see https://openai.com/pricing)
p = Processor(openai_api_key='your-openai-api-key')
print(p.run('/Users/me/some.m4a'))

Introduction

Why yet another library for extracting text from documents? Because textract seems to be more or less abandoned and requires some outdated versions of dependencies. Also it does not support all the file formats I need. Apache Tika is great but surprisingly did not support some of the file formats I needed. So I decided to write a wrapper around a wrapper.

Installation

pip install doctext
brew install ffmpeg imagemagick poppler libheif dcraw

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctext-0.1.6.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

doctext-0.1.6-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file doctext-0.1.6.tar.gz.

File metadata

  • Download URL: doctext-0.1.6.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for doctext-0.1.6.tar.gz
Algorithm Hash digest
SHA256 975d8a9eeccacbae253dfe5b5c09b568489598b5013b9a0b71c14ab00f0a5617
MD5 3525aa3ecfe418c2371df1562229b647
BLAKE2b-256 f0aad72761a11a3dde51f40e241ac2dd2ba0fde5b171d77b691e067cd319d520

See more details on using hashes here.

File details

Details for the file doctext-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: doctext-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for doctext-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9982a2b7baef3573e22e47cc29860a1d561916ceab653cdc0b865bfbb97671cf
MD5 2e7298ac1ca4014a002b20f45ef7e8af
BLAKE2b-256 6df6d80e8bf2643a33d5f90afedb58dabb2778f8659bbec7e84ab3c2586e0a92

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page