Skip to main content

No project description provided

Project description

doctext

Extract text from all kinds of documents. Delegates the heavylifting to other libraries and tools like Apache Tika, tesseract and many more.

Usage

#!/usr/bin/env python
from doctext.Processor import Processor

p = Processor()
print(p.run('/Users/me/some.pptx'))

# specify the language for tesseract
p = Processor()
print(p.run('/Users/me/some-german.png', tesseract_lang='deu'))

# or with Whisper (see https://openai.com/pricing)
p = Processor(openai_api_key='your-openai-api-key')
print(p.run('/Users/me/some.m4a'))

Introduction

Why yet another library for extracting text from documents? Because textract seems to be more or less abandoned and requires some outdated versions of dependencies. Also it does not support all the file formats I need. Apache Tika is great but surprisingly did not support some of the file formats I needed. So I decided to write a wrapper around a wrapper.

Installation

pip install doctext
brew install ffmpeg imagemagick poppler libheif dcraw

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctext-0.1.3.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

doctext-0.1.3-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file doctext-0.1.3.tar.gz.

File metadata

  • Download URL: doctext-0.1.3.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for doctext-0.1.3.tar.gz
Algorithm Hash digest
SHA256 2bd2316f8dba32b932b9857da1a89c8b9b7b1abcd17d5230a92f8d361ff7bf68
MD5 38cd0c24481e1c62afdaf52a6d93a11a
BLAKE2b-256 8d7460af31398daf158e606923f246da538cd52b14d69b4c909c67e03f4ef625

See more details on using hashes here.

File details

Details for the file doctext-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: doctext-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for doctext-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d04e31b1be35eab82163b8d767477d4e18e045a5f7c3d0f00cf6bbe66e7c4dc3
MD5 0c703b1159df06b8a8755772e02348ee
BLAKE2b-256 e8f4cf3e486d51577b26a69992c6e1c0067acbe41fd9f56cdae75bb77fde8529

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page