A robust Python library for extracting information from passports using OCR (Tesseract) and MRZ parsing.
Project description
This is a mockup library used for OCR of identity documents.
Passport
A Python library for extracting information from passports using OCR (Tesseract). It supports parsing MRZ (Machine Readable Zone) and extracting additional fields like "Place of Issue" and "Date of Issue" from the visual zone.
Features
- Robust MRZ Parsing: Handles common OCR errors, corrects line lengths, and supports various MRZ formats (TD1, TD2, TD3).
- Full Text Extraction: Extracts non-MRZ fields like
Place of IssueandDate of Issue. - Data Formatting:
- Dates are standardized to
dd-MM-YYYY. - Country codes are converted to full country names (e.g.,
VNM->Vietnam). - Names are converted to Title Case (e.g.,
NGUYEN VAN A->Nguyen Van A).
- Dates are standardized to
- Input Flexibility: Accepts image file paths or Base64 encoded strings.
- Fallback Logic: If
Date of Issueis missing, it can infer it fromExpiry Date(Expiry - 10 years).
Prerequisites
You need to have Tesseract OCR installed on your system.
Linux (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install tesseract-ocr libtesseract-dev
Linux (Rocky/RHEL/CentOS)
sudo dnf install epel-release
sudo dnf install tesseract tesseract-devel
macOS
brew install tesseract
Windows
Download and install the installer from UB-Mannheim/tesseract.
Installation
- Clone this repository.
- Install Python dependencies:
pip install m-ocr-mockup
Usage
Basic Usage (File Path)
from identity_ocr import read_passport
# Path to your passport image
image_path = "path/to/passport.jpg"
result = read_passport(image_path)
print(result)
Advanced Usage (Base64)
You can pass a Base64 string directly (with or without the data:image/...;base64, header).
from identity_ocr import read_passport
# Your base64 string
base64_string = "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
result = read_passport(base64_string)
print(result)
Output Format
The library returns a dictionary with the extracted fields:
{
'fullname': '', # Combined Surname + Name (Title Case)
'surname': '',
'name': '',
'sex': 'M', # M or F
'birth_date': '', # dd-MM-YYYY
'expiry_date': '', # dd-MM-YYYY
'date_of_issue': '', # dd-MM-YYYY (Extracted or Calculated)
'place_of_issue': '', # Extracted from visual zone
'document_number': '',
'nationality': '', # Full country name
'country': '', # Issuing country
'type': '', # Passport type (TD3 is standard)
'valid': True, # True if MRZ checksums are valid
'raw_mrz': [...] # List of MRZ lines (for debugging)
}
Notes
- Image Quality: High-resolution, glare-free images work best.
- Language Support: The library uses Tesseract's English model (
eng) by default. It works well for most passports (including Vietnamese) as they are bilingual.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file identity_ocr-0.1.1.tar.gz.
File metadata
- Download URL: identity_ocr-0.1.1.tar.gz
- Upload date:
- Size: 15.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d41e4a1307f2001b92171eb37c65d0f7d8504125d2d5bb5f855db72d660fd195
|
|
| MD5 |
bae46f3809a4815187a4d667f96725a6
|
|
| BLAKE2b-256 |
c715ccfd6e23134d4b0468e1f87f09d44279d84c94065c9dc75f282af9621ece
|