Skip to main content

No project description provided

Project description

doctext

Extract text from all kinds of documents. Delegates the heavylifting to other libraries and tools like Apache Tika, tesseract and many more.

Usage

#!/usr/bin/env python
from doctext.Processor import Processor
import logging

# set logging level to INFO to see what's going on
logging.basicConfig(level=logging.INFO)

p = Processor()
print(p.run('/Users/me/some.pptx'))

# specify the language for tesseract
p = Processor()
print(p.run('/Users/me/some-german.png', tesseract_lang='deu'))

# or with Whisper (see https://openai.com/pricing)
p = Processor(openai_api_key='your-openai-api-key')
print(p.run('/Users/me/some.m4a'))

Introduction

Why yet another library for extracting text from documents? Because textract seems to be more or less abandoned and requires some outdated versions of dependencies. Also it does not support all the file formats I need. Apache Tika is great but surprisingly did not support some of the file formats I needed. So I decided to write a wrapper around a wrapper.

Update: I added support for docling as the default first choice for every file. If you are not going to extract text from audio or video files, you should be fine with docling and do not use my library as a wrapper around it.

Installation

pip install doctext
brew install ffmpeg imagemagick poppler libheif dcraw ocrmypdf

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctext-0.2.0.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doctext-0.2.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file doctext-0.2.0.tar.gz.

File metadata

  • Download URL: doctext-0.2.0.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.10

File hashes

Hashes for doctext-0.2.0.tar.gz
Algorithm Hash digest
SHA256 20ae59383f4fb66d73b03dec519738f1d1feaaaa1fa2d6bd5a0b82c581c1e95d
MD5 8d7a22b716f2521372a2160d5a978956
BLAKE2b-256 2970246ded7d89467fb2f6b7a3686b7a98084a0092778e8d6067b6e4efa49784

See more details on using hashes here.

File details

Details for the file doctext-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: doctext-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.10

File hashes

Hashes for doctext-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 145437dc1ab4dfa9c0a70110a18127a53e4a345d1ea2ef58842b7b4e417b2389
MD5 314db44da19841eb1ae6bf54714b07dd
BLAKE2b-256 7198ff49916a24f940384fc6d9ae2b0561568699f171b6a96b60ae537b8d37be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page