A python script to iterate over a list of PDF in a directory and try to guess their language with Tesseract OCR.
Project description
PLD (PDF Language Detector)
PLD is a Python program that analyzes PDF files, extracts images, processes them using Optical Character Recognition (OCR), and detects the dominant language of the text. It provides language detection information in JSON format and calculates the average confidence coefficient for each language.
Requirements
- Python 3.8 or above
- Tesseract OCR
- pdftoppm
Installation
Install Tesseract OCR and pdftoppm using your package manager. For example, on Ubuntu:
sudo apt install tesseract-ocr tesseract-ocr-all poppler-utils
From PyPi
Install with pip:
python3 -m pip install --user pdf-language-detector
Then run directly from your terminal:
pld --help
From the sources
Clone the PLD repository:
git clone git@github.com:github.com/icij/pld.git
Install the required Python packages with poetry:
poetry install
Then run inside a virtual env managed by poetry:
poetry run pld --help
From Docker
Install with Docker:
docker pull icij/pld
Then run inside a container:
docker run -it icij/pld pld --help
Usage
Detect
This command process PDF files and detect the dominant language.
pld detect --help
--language A list of ISO3 language codes to detect.
--input-dir: Path to the input directory containing PDF files. Default is the current directory.
--output-dir (optional): Path to the output directory. Default is 'out' directory in the current directory.
--max-pages (optional): Maximum number of pages to process per PDF file. Default is 5.
--resume (optional): Skip PDF files already analyzed.
--skip-images (optional): Skip the extraction of PDF files a images.
--skip-ocr (optional): Skip the OCR of images from PDF files.
--parallel (optional): Number of threads to run in parallel.
--relative-to (optional): Path to the directory relative to which build the output dir path.
Report
This command print a report from the previously detected language (using the same output dir).
pld report --help
--output-dir: Path to the output directory. Default is 'out' directory in the current directory.
Test
You can run the test suite (propulsed by pytest) with this command:
make test
Examples
Process PDF files in the current directory, detect English and Spanish languages, and save the results in the 'results' directory:
pld --language eng --language spa --input-dir documents --output-dir results
Process PDF files in the 'documents' directory, detect French and Greek languages, and limit the processing to 3 pages per file:
pld --language fra --language ell --input-dir documents --max-pages 3
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pdf_language_detector-0.0.10.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f239a8f7488283b5b43f9b62935c971f7bf745bee710eb69bbee00279eb30c04 |
|
MD5 | efe98717f15d72b856a131c289c1a857 |
|
BLAKE2b-256 | 361f69f6bc9b6d408bc3d5448761bc6aa603cf3315191827f9b550b4614d3382 |
Hashes for pdf_language_detector-0.0.10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b4d19cd974e3d2cebb7666629548e76f5675c88c22e0df1babefe93a06cc55f |
|
MD5 | 660cba106d33dbef652cbcf69bf16500 |
|
BLAKE2b-256 | a472546074461aa6addd81129a70bade32b14748c34b553c3e204681ded4775e |