A robust Python library for extracting information from passports using OCR (Tesseract) and MRZ parsing.

These details have not been verified by PyPI

Project description

Passport OCR Library

A Python library for extracting information from passports using OCR (Tesseract). It supports parsing MRZ (Machine Readable Zone) and extracting additional fields like "Place of Issue" and "Date of Issue" from the visual zone.

Features

Robust MRZ Parsing: Handles common OCR errors, corrects line lengths, and supports various MRZ formats (TD1, TD2, TD3).
Full Text Extraction: Extracts non-MRZ fields like Place of Issue and Date of Issue.
Data Formatting:
- Dates are standardized to dd-MM-YYYY.
- Country codes are converted to full country names (e.g., VNM -> Vietnam).
- Names are converted to Title Case (e.g., LE TIEN DAT -> Le Tien Dat).
Input Flexibility: Accepts image file paths or Base64 encoded strings.
Fallback Logic: If Date of Issue is missing, it can infer it from Expiry Date (Expiry - 10 years).

Prerequisites

You need to have Tesseract OCR installed on your system.

Linux (Ubuntu/Debian)

sudo apt-get update
sudo apt-get install tesseract-ocr libtesseract-dev

Linux (Rocky/RHEL/CentOS)

sudo dnf install epel-release
sudo dnf install tesseract tesseract-devel

macOS

brew install tesseract

Windows

Download and install the installer from UB-Mannheim/tesseract.

Installation

Clone this repository.
Install Python dependencies:

pip install -r requirements.txt

Usage

Basic Usage (File Path)

from passport_ocr import read_passport

# Path to your passport image
image_path = "path/to/passport.jpg"

result = read_passport(image_path)

print(result)

Advanced Usage (Base64)

You can pass a Base64 string directly (with or without the data:image/...;base64, header).

from passport_ocr import read_passport

# Your base64 string
base64_string = "data:image/jpeg;base64,/9j/4AAQSkZJRg..."

result = read_passport(base64_string)

print(result)

Output Format

The library returns a dictionary with the extracted fields:

{
    'fullname': '',       # Combined Surname + Name (Title Case)
    'surname': '',
    'name': '',
    'sex': 'M',                      # M or F
    'birth_date': '',      # dd-MM-YYYY
    'expiry_date': '',     # dd-MM-YYYY
    'date_of_issue': '',   # dd-MM-YYYY (Extracted or Calculated)
    'place_of_issue': '', # Extracted from visual zone
    'document_number': '',
    'nationality': '',        # Full country name
    'country': '',            # Issuing country
    'type': '',                   # Passport type (TD3 is standard)
    'valid': True,                   # True if MRZ checksums are valid
    'raw_mrz': [...]                 # List of MRZ lines (for debugging)
}

Notes

Image Quality: High-resolution, glare-free images work best.
Language Support: The library uses Tesseract's English model (eng) by default. It works well for most passports (including Vietnamese) as they are bilingual.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Dec 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

passport_ocr-0.1.0.tar.gz (403.8 kB view details)

Uploaded Dec 11, 2025 Source

File details

Details for the file passport_ocr-0.1.0.tar.gz.

File metadata

Download URL: passport_ocr-0.1.0.tar.gz
Upload date: Dec 11, 2025
Size: 403.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for passport_ocr-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ea496f307393e024fc7ae6696d1f0112b359e8395feebfc363bb7f1228b0a210`
MD5	`a867d38443741312aed48164a93b9e48`
BLAKE2b-256	`6d9e6186cb52e0be2a4352af6fdd985befc7430b75c5e391c7ef1847cc73eb38`

See more details on using hashes here.

passport-ocr 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers