Skip to main content

Multi-column PDF to Text

Project description

mc-pdf2txt

Convert multi-column pdf to text with poppler and tesseract.

Install

(1) Install dependencies:

Install poppler.

sudo apt install poppler-utils

Install tesseract-ocr

sudo apt install tesseract-ocr

with the language data files of your choice, e.g.,

sudo apt install tesseract-ocr-jpn

(2) Install mc-pdf2txt

To make mc-pdf2txt compatible with both docopt and docopt-ng, dependencies on them are now explicitly extra dependencies.

If you know either docopt or docopt-ng is already installed on your system, just try the following:

pip3 install mc-pdf2txt

If you are unsure docopt or docopt-ng is installed on your system, try the following:

pip3 install mc-pdf2txt[docopt-ng]

Usage

Usage:
  mc-pdf2txt [options] <input>...

Options:
  -l LANG           Language, such as `eng`, `jpn`, or `eng+jpn`.
  <input>           Input PDF file.
  -o OUTPUT         Output text file.
  -r DPI            Resolution of temporary image file [default: 600].
  --timeout SEC     Timeout in sec to exec `pdftoppm` [default: 60].
  --page-separator LINE     String to be output as page separator [default: ---].
  --psm VALUE       Page segmentation mode of `tessoract-ocr` [default: 3].
  --verbose         Verbose.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mc-pdf2txt-0.2.0.tar.gz (4.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page