Skip to main content

No project description provided

Project description

doctext

Extract text from all kinds of documents. Delegates the heavylifting to other libraries and tools like Apache Tika, tesseract and many more.

Usage

#!/usr/bin/env python
from doctext.Processor import Processor

p = Processor()
print(p.run('/Users/me/some.pptx'))

# or with Whisper (see https://openai.com/pricing)
p = Processor(openai_api_key='your-openai-api-key')
print(p.run('/Users/me/some.m4a'))

Introduction

Why yet another library for extracting text from documents? Because textract seems to be more or less abandoned and requires some outdated versions of dependencies. Also it does not support all the file formats I need. Apache Tika is great but surprisingly did not support some of the file formats I needed. So I decided to write a wrapper around a wrapper.

Installation

pip install doctext
brew install ffmpeg imagemagick poppler libheif dcraw

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctext-0.1.1.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

doctext-0.1.1-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file doctext-0.1.1.tar.gz.

File metadata

  • Download URL: doctext-0.1.1.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for doctext-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0d96bf2bf49e60d31cb954980eb3862cd8167d524acdd1748a8c9830e0b89adc
MD5 17ca33fe6f1fe27fb612dbda9cf95af0
BLAKE2b-256 ae2bc30e7785aa4a3be10c51de3ee9820b40c8d09fa7a7f3bfffc651d3ab2fce

See more details on using hashes here.

File details

Details for the file doctext-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: doctext-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for doctext-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bd1f98f6f1b567625fdd60f4c5e2896784e3789ecfcc52e615798d2070e6e380
MD5 012c64981f9ff7c4160a936936ee97fe
BLAKE2b-256 5fb2b4e0743e061c045f156c88bd2d42849b6c8f1075aaeb1b6f9cfdc9f8fb28

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page