A python script to iterate over a list of PDF in a directory and try to guess their language with Tesseract OCR.
Project description
PLD (PDF Language Detector)
PLD is a Python program that analyzes PDF files, extracts images, processes them using Optical Character Recognition (OCR), and detects the dominant language of the text. It provides language detection information in JSON format and calculates the average confidence coefficient for each language.
Requirements
- Python 3.8 or above
- Tesseract OCR
- pdftoppm
Installation
Install Tesseract OCR and pdftoppm using your package manager. For example, on Ubuntu:
sudo apt install tesseract-ocr tesseract-ocr-all poppler-utils
From PyPi
Install with pip:
python3 -m pip install --user pdf-language-detector
From the sources
Clone the PLD repository:
git clone git@github.com:github.com/icij/pld.git
Install the required Python packages with poetry:
poetry install
Usage
pld --help
--language A comma-separated list of ISO3 language codes to detect.
--input-dir: Path to the input directory containing PDF files. Default is the current directory.
--output-dir (optional): Path to the output directory. Default is 'out' directory in the current directory.
--max-pages (optional): Maximum number of pages to process per PDF file. Default is 5.
Examples
Process PDF files in the current directory, detect English and Spanish languages, and save the results in the 'results' directory:
pld --language eng --language spa --input-dir documents --output-dir results
Process PDF files in the 'documents' directory, detect French and Greek languages, and limit the processing to 3 pages per file:
pld --language fra --language ell --input-dir documents --max-pages 3
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pdf_language_detector-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7372df27268c0dba6929930d739c7463d9b629402b27d2499d941086259a8c68 |
|
MD5 | 20ae5fa7fadc4f55e327b2fc1cf69841 |
|
BLAKE2b-256 | f8ed4d4b4e6c01ac2359cd4462b1f64d898b0407afebc817c65d38dad62fc1fa |
Hashes for pdf_language_detector-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | afb4285a322c3b899c77bd92c0dd4f78912c98c97779154d148d2d7a8aad4864 |
|
MD5 | b158749c61193e0fe8d50da67ebc32ba |
|
BLAKE2b-256 | 6f64a898611767edce3fa3798ad5845e20848f336e3fc61ddd32d8cde9de36d5 |