Skip to main content

A Python package for extracting information from Panamanian identity document (Cédula) and Passports

Project description

Document Analyzer

Document Analyzer is a Python package for extracting structured information from identity documents using PaddleOCR. It supports Panamanian ID cards (Cédulas) in Spanish, and passports with standard ICAO Machine Readable Zones (MRZ) in Spanish or English. The package automatically detects the document type and language, loading the appropriate OCR instance accordingly. It is specifically designed to work with mobile phone photos of documents rather than scans or PDFs, and includes automatic image preprocessing to improve extraction accuracy from lower-quality images.

Version

Current version: 0.1.0 — Initial release.

Features

  • Cédula Extraction — Extract ID number, date of birth, place of birth, expiry date, and handwritten signature detection from Panamanian identity cards
  • Passport Extraction — Extract ID number, date of birth, place of birth, nationality, and expiry date from passports with standard ICAO Machine Readable Zones (MRZ). Works with any country's passport that follows the ICAO standard format.
  • Automatic Document Detection — Intelligently detect whether an image contains a Cédula or Passport
  • Image Preprocessing — Automatically enhance poor quality images before OCR processing
  • CLI Support — Full command-line interface for document analysis without writing code
  • JSON Output — Structured JSON results for easy integration into other systems
  • Multi-Language Support — Cédulas are processed in Spanish only. Passports support automatic language detection between Spanish and English, with the appropriate PaddleOCR instance loaded based on detected language.

Requirements

  • Python 3.8 to 3.12
  • PaddleOCR 3.2.0

Installation

pip install document-analyzer

CLI Usage

The package includes a command-line interface accessible via the document-analyzer command.

Basic Usage with Auto-Detection

Analyze a document with automatic type detection:

document-analyzer analyze photo.jpg

The output is printed as JSON to stdout.

Specify Document Type

If you know the document type, you can skip auto-detection for faster processing:

document-analyzer analyze cedula.jpg --type cedula
document-analyzer analyze passport.jpg --type passport

Save Output to File

Save analysis results to a JSON file instead of printing to stdout:

document-analyzer analyze photo.jpg --save result.json

Verbose Mode

Enable debug-level logging to see detailed processing information:

document-analyzer analyze photo.jpg -v

Combine with --save for logging while saving results:

document-analyzer analyze photo.jpg --save result.json -v

Help

View all available options:

document-analyzer analyze --help

Library Usage

You can use Document Analyzer as a Python library in your own code. Here are examples for the main use cases.

Auto-Detection with DocumentAnalyzer

from document_analyzer import DocumentAnalyzer

# Initialize with image path
analyzer = DocumentAnalyzer("photo.jpg")

# Detect document type
doc_type = analyzer.detect_document_type()
print(f"Detected: {doc_type}")  # "cedula" or "passport" or "unknown"

Extract from Cédula

from document_analyzer import CedulaAnalyzer

# Initialize with image path
analyzer = CedulaAnalyzer("cedula.jpg")

# Analyze the document
results = analyzer.analyze_cedula()
print(results)

# Optional: provide user email for logging context
analyzer = CedulaAnalyzer("cedula.jpg", user_email="user@example.com")

Extract from Passport

from document_analyzer import PassportAnalyzer

# Initialize with image path
analyzer = PassportAnalyzer("passport.jpg")

# Analyze the document
results = analyzer.analyze_passport()
print(results)

# Optional: provide user email for logging context
analyzer = PassportAnalyzer("passport.jpg", user_email="user@example.com")

Convenience Functions

You can also use high-level functions for simpler code:

from document_analyzer import analyze_document, analyze_cedula, analyze_passport

# Auto-detect and analyze
result = analyze_document("photo.jpg")

# Analyze specific document type
cedula_result = analyze_cedula("cedula.jpg")
passport_result = analyze_passport("passport.jpg")

Output

Analysis results are returned as dictionaries containing structured information about the extracted data. Below are example outputs for both document types with realistic but fictional Panamanian data.

Cédula Output Example

{
    "success": "both",
    "cedula_info": {
        "type": "cedula",
        "id_number": "8-123-456",
        "dob": "15-May-1990",
        "pob": "Panama",
        "nationality": "Panamanian",
        "expiry": "22-Mar-2030"
    },
    "signature": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNk+M9QDwADhgGAWjR9awAAAABJRU5ErkJggg=="
}

The success field can be "both" (all info + signature), "cedula_info" (all info but no signature), "signature" (signature only), or "none" (extraction failed).

Passport Output Example

{
    "success": "passport_info",
    "passport_info": {
        "type": "passport",
        "id_number": "PA123456789",
        "dob": "20-Nov-1988",
        "pob": "Colón",
        "nationality": "PAN",
        "expiry": "10-Sep-2032"
    },
    "signature": null
}

The success field can be "passport_info" (extraction successful) or "none" (extraction failed).

Image Requirements

Document Analyzer is designed to work with mobile phone photos of documents. Here are the technical requirements:

  • Supported Formats — JPEG, PNG, BMP, TIFF, GIF
  • Orientation — Portrait orientation works best
  • Quality — Mobile phone camera quality is acceptable; the package includes automatic preprocessing to handle lower quality images
  • Coverage — Entire document should be visible in the frame
  • Lighting — Avoid strong shadows or glare across the document

The package includes automatic image preprocessing that attempts to enhance poor quality images before OCR processing. This can help improve accuracy for images with:

  • Low contrast
  • Poor lighting conditions
  • Motion blur
  • Dust or slight damage

Note on PDFs: PDF files are not listed in supported formats because they have not been tested. PDFs are not officially supported and may not work as expected. Use image files (JPG, PNG, etc.) for best results.

GPU Acceleration

PaddleOCR supports GPU acceleration via CUDA for significantly faster processing on NVIDIA GPUs. However, Document Analyzer has only been tested and validated on CPU hardware (Intel i5, 10th generation).

If you want to experiment with GPU acceleration, you will need to:

  1. Configure PaddleOCR to use your CUDA-enabled GPU according to the PaddleOCR documentation
  2. Ensure your system has CUDA and cuDNN properly configured
  3. Test thoroughly in your environment before deploying to production

CPU processing is stable and recommended for production use.

Logging

Document Analyzer uses Python's standard logging module with the logger namespace document_analyzer. This allows you to configure logging behavior in your own applications.

Basic Configuration

import logging

# Enable debug logging from document_analyzer
logging.basicConfig(level=logging.DEBUG)

Django Configuration

If you're using Django and want to capture logs from Document Analyzer, add this to your settings.py:

LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'handlers': {
        'console': {
            'class': 'logging.StreamHandler',
        },
        'file': {
            'class': 'logging.FileHandler',
            'filename': 'document_analyzer.log',
        },
    },
    'loggers': {
        'document_analyzer': {
            'handlers': ['console', 'file'],
            'level': 'DEBUG',
        },
    },
}

Flask Configuration

For Flask applications:

import logging
from logging.handlers import RotatingFileHandler

if not app.debug:
    handler = RotatingFileHandler('document_analyzer.log', maxBytes=10000000, backupCount=10)
    handler.setLevel(logging.DEBUG)
    app.logger.addHandler(handler)
    
    # Get the document_analyzer logger
    doc_logger = logging.getLogger('document_analyzer')
    doc_logger.addHandler(handler)
    doc_logger.setLevel(logging.DEBUG)

Limitations

Be aware of the following limitations when using Document Analyzer:

  • Cédula Support — Cédula extraction is specifically designed for Panamanian identity cards in Spanish only. Non-Panamanian identity documents are not supported. Passport extraction works with any standard ICAO MRZ passport regardless of country.

  • Cédula Language — Panamanian Cédulas are processed in Spanish only. English or other languages are not supported for Cédulas.

  • Image Quality Dependency — Extraction accuracy depends on image quality. Very poor lighting, severe blur, or damaged documents may produce incomplete or inaccurate results. While the package includes preprocessing to improve poor quality images, there are limits to what can be recovered.

  • PDF Support Not Tested — PDFs are not officially supported and have not been tested. The package is designed for and tested with image files (JPG, PNG, etc.).

  • Passport MRZ Dependency — Passport extraction relies primarily on the Machine Readable Zone (MRZ) at the bottom of the document page. If the MRZ is obscured, cut off, or damaged in the photo, extraction accuracy will be significantly affected. Ensure the entire document including the bottom strip is clearly visible in the frame.

  • Place of Birth for Non-Panamanian Passports — Place of birth is the only passport field extracted from the document's written fields rather than the MRZ. This works reliably for Panamanian passports. For other countries it may be inaccurate or missing depending on how that country formats and labels the biographical page of their passport.

  • CPU Testing Only — The package has only been tested on CPU hardware (Intel i5, 10th generation). GPU acceleration via CUDA may work but is not officially supported or validated.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Author

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_analyzer-0.1.0.tar.gz (43.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_analyzer-0.1.0-py3-none-any.whl (48.5 kB view details)

Uploaded Python 3

File details

Details for the file document_analyzer-0.1.0.tar.gz.

File metadata

  • Download URL: document_analyzer-0.1.0.tar.gz
  • Upload date:
  • Size: 43.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for document_analyzer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2b818e10edaf2cfbd02518d747900984f444d05c8de7c4c66f9771922860d8b4
MD5 4fa14ee94de75e0637befbd556b3a7e5
BLAKE2b-256 10780277c543dab2366f5e1fa46260f5ba10baf30994834edfb8c8bbbdeeeb37

See more details on using hashes here.

File details

Details for the file document_analyzer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for document_analyzer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 87a17f1a1f04b2d3a74c361a0567e758efc0fecc3d36df5c38539788a8e12553
MD5 fd827f26ba378a10f23cb4e3a10cf200
BLAKE2b-256 6bbc33cc0c96471d33efba939c585138c7ec2d437c526d219a987116976a4c19

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page