Skip to main content

Multi-column PDF to Text

Project description

mc-pdf2txt

Convert multi-column pdf to text with poppler and tesseract.

Install

(1) Install dependencies:

Install poppler.

sudo apt install poppler-utils

Install tesseract-ocr

sudo apt install tesseract-ocr

with the language data files of your choice, e.g.,

sudo apt install tesseract-ocr-jpn

(2) Install mc-pdf2txt

To make mc-pdf2txt compatible with both docopt and docopt-ng, dependencies on them are now explicitly extra dependencies.

If you know either docopt or docopt-ng is already installed on your system, just try the following:

pip3 install mc-pdf2txt

If you are unsure docopt or docopt-ng is installed on your system, try the following:

pip3 install mc-pdf2txt[docopt-ng]

Usage

Usage:
  mc-pdf2txt [options] <input>...

Options:
  -l LANG           Language, such as `eng`, `jpn`, or `eng+jpn`.
  <input>           Input PDF file.
  -o OUTPUT         Output text file.
  -r DPI            Resolution of temporary image file [default: 600].
  --timeout SEC     Timeout in sec to exec `pdftoppm` [default: 60].
  --page-separator LINE     String to be output as page separator [default: ---].
  --psm VALUE       Page segmentation mode of `tessoract-ocr` [default: 3].
  --verbose         Verbose.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mc-pdf2txt-0.2.0.tar.gz (4.3 kB view details)

Uploaded Source

File details

Details for the file mc-pdf2txt-0.2.0.tar.gz.

File metadata

  • Download URL: mc-pdf2txt-0.2.0.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for mc-pdf2txt-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7c098a46c214dbd60d2d8d655461fce67f18e36a6c012ca65bb9d8a07e90bdf9
MD5 57418a4209117f698636a2e1570eae0e
BLAKE2b-256 2afa6373a7a27f5a01093fd86c2f68867b2ea9dc8f5f7488089f29da43220a29

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page