Skip to main content

No project description provided

Project description

doctext

Extract text from all kinds of documents. Delegates the heavylifting to other libraries and tools like Apache Tika, tesseract and many more.

Usage

#!/usr/bin/env python
from doctext.Processor import Processor

p = Processor()
print(p.run('/Users/me/some.pptx'))

# or with Whisper (see https://openai.com/pricing)
p = Processor(openai_api_key='your-openai-api-key')
print(p.run('/Users/me/some.m4a'))

Introduction

Why yet another library for extracting text from documents? Because textract seems to be more or less abandoned and requires some outdated versions of dependencies. Also it does not support all the file formats I need. Apache Tika is great but surprisingly did not support some of the file formats I needed. So I decided to write a wrapper around a wrapper.

Installation

pip install doctext
brew install ffmpeg imagemagick poppler libheif dcraw

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctext-0.1.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

doctext-0.1-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file doctext-0.1.tar.gz.

File metadata

  • Download URL: doctext-0.1.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.1

File hashes

Hashes for doctext-0.1.tar.gz
Algorithm Hash digest
SHA256 0fb31fa932e5ed2967980c9aefbd0f87914fdeb938ad18965d33437c43e7873e
MD5 a487345e9c1035cfc9ceb68968e0c58d
BLAKE2b-256 5a84fff0faa85a6b3f473a46d28f60c5b491abc5943fdc29bfad555348048d9e

See more details on using hashes here.

File details

Details for the file doctext-0.1-py3-none-any.whl.

File metadata

  • Download URL: doctext-0.1-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.1

File hashes

Hashes for doctext-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8f10e69484717a790459d971bc9f124ff46e90f8c29b8a85cd16e9b1ae58c157
MD5 0a1c91ddcd169052ac314721b5e7040b
BLAKE2b-256 84c720cebfb600b0a9b65da27028a6cb0d91a9da35c783badc10a267c6118e8a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page