Skip to main content

A script built on Tesseract-OCR for converting .pdf to .txt

Project description

MotionPDF

A crossplatform Python tool for converting PDFs to plaintext, built on the Tesseract OCR Open Source Library and pyPDF-OCR. It was designed to make the process of scanning zines and pamphlets into readable, accessible formats easier.

Prerequisites

This application is tested with Tesseract-OCR 4.1. As such, Tesseract-OCR 4.1 or higher should be installed on the system and located in the system PATH.

For guidance on how to do this, please see the Tesseract user manual. For installation on Windows machines, see the following resource.

If you would like to use a language other than English (explained below), you must install it to Tessearct first. For Mac OS or Ubuntu, additional Tessearct languages may be available through Homebrew brew install tesseract-[lang] or apt apt install tesseract-[lang] respectively. Tesseract language files can also be downloaded and installed manually

Installation

The entire package will be available to install on pip using python3 -m pip install motionpdf.

Behavior

MotionPDF converts PDF files with no available plaintext information into a single .txt file.

Usage

This program is simply a user-friendly wrapper for Tesseract. The command-line tool can be used as follows:

motionpdf (path) [-v] [-o path_to_output] [-l languages] [-L line width]

where square brackets indicate optional arguments and round brackets indicate mandatory, positional arguments.

motionpdf -h

will generate the help page for the command line tool and explain each option in detail.

Flags

  • -v or --verbose enables verbose mode. Verbose mode will save the images created from the provided PDF files, as well as print more detailed error messages in the event of a failure. By default this option is disabled.

  • -o or --output allows the user to specify the path the resulting .txt file will be generated. By default, files will be generated in a new directory called "text" in the current working directory

  • -l or --language passes the given language directly to tesseract. To specify one language, use its language code. To specify that a PDF has multiple languages, put both language codes in this flag, separated by a + (e.g. eng+fra).

  • -L or --linewidth tries to organize the text so that each line has at most the specified number of characters. By default, this is set to 0, meaning that the program will let Tesseract try to reproduce the line spacing found in the original document.

Flaws

MotionPDF relies on Tesseract-OCR, a powerful and open source OCR engine. However, Tesseract is not as good as commercial OCR engines. For best results, scan zines, pamphlets, and other texts such that the text is flat and undistorted.

When the process is complete, the resulting text will most likely still have garbage, incorrect lines, or out of order lines. The user can fix these manually to get the clean text, then move that text into whatever format they desire.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

motionpdf-0.0.1.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

motionpdf-0.0.1-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file motionpdf-0.0.1.tar.gz.

File metadata

  • Download URL: motionpdf-0.0.1.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.0 importlib-metadata/4.11.1 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for motionpdf-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b26e37a5f7c3ee1c20d480f57c48c93a3574698e755267f220e6d4a635e853e9
MD5 ba0f71eaf368f187c8f162e1673b82fa
BLAKE2b-256 1ef6ac039e4464a0380290ae3fbbb7ebbd115b0528304f49cc143337db28e463

See more details on using hashes here.

File details

Details for the file motionpdf-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: motionpdf-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.0 importlib-metadata/4.11.1 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for motionpdf-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 793e6c08ab7d324a232f3c58ee8d8aa5218dbcd76a54078a6e62a0f800324b00
MD5 a205e3949947c1344daabe6c6d04e9a1
BLAKE2b-256 65482a1eb843f06ab5b319a6a69ce58dbf8e0be31e60c6f05613846c977d682f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page