Skip to main content

No project description provided

Project description

doctext

Extract text from all kinds of documents. Delegates the heavylifting to other libraries and tools like Apache Tika, tesseract and many more.

Usage

#!/usr/bin/env python
from doctext.Processor import Processor
import logging

# set logging level to INFO to see what's going on
logging.basicConfig(level=logging.INFO)

p = Processor()
print(p.run('/Users/me/some.pptx'))

# specify the language for tesseract
p = Processor()
print(p.run('/Users/me/some-german.png', tesseract_lang='deu'))

# or with Whisper (see https://openai.com/pricing)
p = Processor(openai_api_key='your-openai-api-key')
print(p.run('/Users/me/some.m4a'))

Introduction

Why yet another library for extracting text from documents? Because textract seems to be more or less abandoned and requires some outdated versions of dependencies. Also it does not support all the file formats I need. Apache Tika is great but surprisingly did not support some of the file formats I needed. So I decided to write a wrapper around a wrapper.

Installation

pip install doctext
brew install ffmpeg imagemagick poppler libheif dcraw

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctext-0.1.7.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

doctext-0.1.7-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file doctext-0.1.7.tar.gz.

File metadata

  • Download URL: doctext-0.1.7.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for doctext-0.1.7.tar.gz
Algorithm Hash digest
SHA256 7a41d71039ce5366c4ae85687ab89ea9840ce5404718ec79b144f4ec2ddd1820
MD5 1c2f6096df7546b9c7c8f0bde02772b3
BLAKE2b-256 d82f8419d9ad674e61630c4e7dc6ac26104a3eb1e9c50eea3deca0a5e4b9a190

See more details on using hashes here.

File details

Details for the file doctext-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: doctext-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for doctext-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 dc04063eee3afab255a7b5ee9b28845ef3c2f03bc97211e935ad178102af952b
MD5 32410c03c466cec14c157ed00a9782c5
BLAKE2b-256 fdba62176a2b22b7f62e2030947003358d155a76366e3919ed51e60149a1ffa7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page