Skip to main content

A python script to iterate over a list of PDF in a directory and try to guess their language with Tesseract OCR.

Project description

PLD (PDF Language Detector)

PLD is a Python program that analyzes PDF files, extracts images, processes them using Optical Character Recognition (OCR), and detects the dominant language of the text. It provides language detection information in JSON format and calculates the average confidence coefficient for each language.

Requirements

Installation

Install Tesseract OCR and pdftoppm using your package manager. For example, on Ubuntu:

sudo apt install tesseract-ocr tesseract-ocr-all poppler-utils

From PyPi

Install with pip:

python3 -m pip install --user pdf-language-detector

Then run directly from your terminal:

pld --help

From the sources

Clone the PLD repository:

git clone git@github.com:github.com/icij/pld.git

Install the required Python packages with poetry:

poetry install

Then run inside a virtual env managed by poetry:

poetry run pld --help

From Docker

Install with Docker:

docker pull icij/pld

Then run inside a container:

docker run -it icij/pld pld --help

Usage

Detect

This command process PDF files and detect the dominant language.

pld detect --help

    --language A list of ISO3 language codes to detect.
    --input-dir: Path to the input directory containing PDF files. Default is the current directory.
    --output-dir (optional): Path to the output directory. Default is 'out' directory in the current directory.
    --max-pages (optional): Maximum number of pages to process per PDF file. Default is 5.
    --resume (optional): Skip PDF files already analyzed.
    --skip-images (optional): Skip the extraction of PDF files a images.
    --skip-ocr (optional): Skip the OCR of images from PDF files.
    --parallel (optional): Number of threads to run in parallel.
    --relative-to (optional): Path to the directory relative to which build the output dir path.

Report

This command print a report from the previously detected language (using the same output dir).

pld report --help

    --output-dir: Path to the output directory. Default is 'out' directory in the current directory.

Test

You can run the test suite (propulsed by pytest) with this command:

make test

Examples

Process PDF files in the current directory, detect English and Spanish languages, and save the results in the 'results' directory:

pld --language eng --language spa --input-dir documents --output-dir results

Process PDF files in the 'documents' directory, detect French and Greek languages, and limit the processing to 3 pages per file:

pld --language fra --language ell --input-dir documents --max-pages 3

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_language_detector-0.0.11.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

pdf_language_detector-0.0.11-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file pdf_language_detector-0.0.11.tar.gz.

File metadata

  • Download URL: pdf_language_detector-0.0.11.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.10.6 Linux/5.19.0-45-generic

File hashes

Hashes for pdf_language_detector-0.0.11.tar.gz
Algorithm Hash digest
SHA256 9a483cf13d0d32d246c82671a8e1a49ea05cb01a6d635d59206f562bf2c59e76
MD5 c4c96ab956eeab90333861d10e339432
BLAKE2b-256 ed389b1cf4d9ae963d4a5bec0957da7b626be1de6fc8cb4dd7bc1aa9894681ff

See more details on using hashes here.

File details

Details for the file pdf_language_detector-0.0.11-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_language_detector-0.0.11-py3-none-any.whl
Algorithm Hash digest
SHA256 4c624a4fd8664a8e856a39bca1864b55b790111e6a17c023ca6b033bfeacd5fd
MD5 e675bd001687ca977126d8589affa49c
BLAKE2b-256 9b97b2e42125a4ca9d5a8eaf4766b4bd97ff83e68711c591d97c62991afcb113

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page