Skip to main content

A robust Python library for extracting information from passports using OCR (Tesseract) and MRZ parsing.

Project description

This is a mockup library used for OCR of identity documents.

Passport

A Python library for extracting information from passports using OCR (Tesseract). It supports parsing MRZ (Machine Readable Zone) and extracting additional fields like "Place of Issue" and "Date of Issue" from the visual zone.

Features

  • Robust MRZ Parsing: Handles common OCR errors, corrects line lengths, and supports various MRZ formats (TD1, TD2, TD3).
  • Full Text Extraction: Extracts non-MRZ fields like Place of Issue and Date of Issue.
  • Data Formatting:
    • Dates are standardized to dd-MM-YYYY.
    • Country codes are converted to full country names (e.g., VNM -> Vietnam).
    • Names are converted to Title Case (e.g., NGUYEN VAN A -> Nguyen Van A).
  • Input Flexibility: Accepts image file paths or Base64 encoded strings.
  • Fallback Logic: If Date of Issue is missing, it can infer it from Expiry Date (Expiry - 10 years).

Prerequisites

You need to have Tesseract OCR installed on your system.

Linux (Ubuntu/Debian)

sudo apt-get update
sudo apt-get install tesseract-ocr libtesseract-dev

Linux (Rocky/RHEL/CentOS)

sudo dnf install epel-release
sudo dnf install tesseract

[!NOTE] If you encounter version conflicts with language packs, installing just tesseract is often sufficient as it usually includes English data. If you need other languages, ensure the version matches the installed tesseract version.

macOS

brew install tesseract

Windows

Download and install the installer from UB-Mannheim/tesseract.

Installation

  1. Clone this repository.
  2. Install Python dependencies:
pip install m-ocr-mockup

Usage

Basic Usage (File Path)

from m_identify_ocr import read_passport

# Path to your passport image
image_path = "path/to/passport.jpg"

result = read_passport(image_path)

print(result)

Advanced Usage (Base64)

You can pass a Base64 string directly (with or without the data:image/...;base64, header).

from m_identify_ocr import read_passport

# Your base64 string
base64_string = "data:image/jpeg;base64,/9j/4AAQSkZJRg..."

result = read_passport(base64_string)

print(result)

Output Format

The library returns a dictionary with the extracted fields:

{
    'fullname': '',       # Combined Surname + Name (Title Case)
    'surname': '',
    'name': '',
    'sex': 'M',                      # M or F
    'birth_date': '',      # dd-MM-YYYY
    'expiry_date': '',     # dd-MM-YYYY
    'date_of_issue': '',   # dd-MM-YYYY (Extracted or Calculated)
    'place_of_issue': '', # Extracted from visual zone
    'document_number': '',
    'nationality': '',        # Full country name
    'country': '',            # Issuing country
    'type': '',                   # Passport type (TD3 is standard)
    'valid': True,                   # True if MRZ checksums are valid
    'raw_mrz': [...]                 # List of MRZ lines (for debugging)
}

Notes

  • Image Quality: High-resolution, glare-free images work best.
  • Language Support: The library uses Tesseract's English model (eng) by default. It works well for most passports (including Vietnamese) as they are bilingual.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

m_identify_ocr-1.0.1.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

m_identify_ocr-1.0.1-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file m_identify_ocr-1.0.1.tar.gz.

File metadata

  • Download URL: m_identify_ocr-1.0.1.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for m_identify_ocr-1.0.1.tar.gz
Algorithm Hash digest
SHA256 f094f4fb5fc6305d329a3ef089f765afca4fe4496982aa010a961b2d79cb3c99
MD5 1d80f7527160f66113d2a925b911750c
BLAKE2b-256 4ad823c75b6ee4e9497f4591d1035f7d951ca73b43677b4cd10683f23cb02849

See more details on using hashes here.

File details

Details for the file m_identify_ocr-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: m_identify_ocr-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 15.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for m_identify_ocr-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bfb2370d7bcb51d900b7493898e2429d65cf7ab53231765071c414b31857a653
MD5 a6ae39838ddbac70938cf59d4dc264c7
BLAKE2b-256 46f70a03a9f87a774d7659019f2a3947b908196db63221e85355d696567e25f7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page