Skip to main content

A tool for verifying PDF statements from Tanzanian and beyond institutions

Project description

Statement Verification

A Python package for verifying PDF statements from financial institutions. Extracts metadata, detects the issuing institution, and provides verification scores.

Installation

# From PyPI (recommended)
pip install sverification

# Or from source
git clone https://github.com/Tausi-Africa/statement-verification.git
cd statement-verification
pip install -e .

Quick Start

# Command line usage with font comparison
verify-statement path/to/statement.pdf --brands statements_metadata.json --font-data statements_font_data.json
# Python API - Simple verification
import sverification

result = sverification.verify_statement_verbose("statement.pdf")
print(f"Brand: {result['detected_brand']}, Score: {result['combined_score']:.1f}%")
print(f"Metadata: {result['verification_score']:.1f}%, Font: {result['font_score']:.1f}%")

📚 Function Reference

1. verify_statement_verbose() - Complete Verification with Font Analysis

Purpose: Performs complete statement verification with metadata and font comparison.

import sverification

# Basic usage with both metadata and font verification
result = sverification.verify_statement_verbose("statement.pdf")

# With custom template files
result = sverification.verify_statement_verbose(
    pdf_path="statement.pdf",
    brands_json_path="custom_brands.json",
    font_data_json_path="custom_font_data.json"
)

# Access comprehensive results
print(f"Detected Brand: {result['detected_brand']}")
print(f"Combined Score: {result['combined_score']:.1f}%")
print(f"Metadata Score: {result['verification_score']:.1f}%")
print(f"Font Score: {result['font_score']:.1f}%")

# Check metadata fields
for field in result['field_results']:
    status = "✓" if field['match'] else "✗"
    print(f"[{status}] {field['field']}: {field['actual']} (expected: {field['expected']})")

# Check font fields
for font_field in result['font_results']:
    status = "✓" if font_field['match'] else "✗"
    print(f"[{status}] {font_field['field']}: {font_field['actual']} (expected: {font_field['expected']})")

Returns: Dictionary with complete verification data

  • detected_brand: Institution name
  • combined_score: Overall score combining metadata and font analysis
  • verification_score: Metadata verification score (0-100)
  • font_score: Font comparison score (0-100)
  • field_results: List of metadata field comparisons
  • font_results: List of font field comparisons
  • total_fields: Number of metadata fields checked
  • matched_fields: Number of matching metadata fields
  • total_font_fields: Number of font fields checked
  • matched_font_fields: Number of matching font fields
  • summary: Human-readable summary

2. print_verification_report() - Formatted Output with Font Analysis

Purpose: Prints a formatted verification report including font comparison.

import sverification

# Get verification results
result = sverification.verify_statement_verbose("statement.pdf")

# Print formatted report (same as CLI output)
sverification.print_verification_report(result)

# Output example:
# ========================================================================
# PDF: statement.pdf
# Detected brand: selcom
# Template used: selcom
# Metadata fields checked: 5
# Metadata fields matched: 5
# Metadata score: 100.0%
# Font fields checked: 4
# Font fields matched: 3
# Font score: 75.0%
# Combined verification score: 91.7%
# ------------------------------------------------------------------------
# Metadata Comparison (expected vs. actual):
#   [✓] pdf_version      expected='1.4'  actual='1.4'
#   [✓] creator          expected='Selcom'  actual='Selcom'
# ------------------------------------------------------------------------
# Font Comparison (expected vs. actual):
#   [✗] font_pdf_version expected='PDF-1.7'  actual='PDF-1.4'
#   [✓] font_count       expected=2  actual=2
#   [✓] font_names       expected=['Helvetica']  actual=['Helvetica']
# ------------------------------------------------------------------------
# Metadata Score: 100.0% | Font Score: 75.0% | Combined: 91.7%
# ========================================================================

3. extract_all() - PDF Metadata Extraction

Purpose: Extracts comprehensive metadata from PDF files.

import sverification

# Extract metadata
metadata = sverification.extract_all("statement.pdf")

# Access specific metadata
print(f"PDF Version: {metadata['pdf_version']}")
print(f"Creator: {metadata['creator']}")
print(f"Producer: {metadata['producer']}")
print(f"Creation Date: {metadata['creationdate']}")
print(f"Modification Date: {metadata['moddate']}")
print(f"EOF Markers: {metadata['eof_markers']}")
print(f"PDF Versions: {metadata['pdf_versions']}")

# Check for potential issues
if metadata['eof_markers'] > 1:
    print("⚠️  Multiple EOF markers detected")

if metadata['creationdate'] != metadata['moddate']:
    print("⚠️  Creation and modification dates differ")

Returns: Dictionary with extracted metadata

  • pdf_version: PDF specification version
  • creator: Application that created the PDF
  • producer: Software that produced the PDF
  • creationdate: When PDF was created
  • moddate: When PDF was last modified
  • eof_markers: Number of EOF markers (security indicator)
  • pdf_versions: Number of PDF versions

4. get_company_name() - Institution Detection

Purpose: Automatically detects the financial institution from PDF content.

import sverification

# Detect institution
company = sverification.get_company_name("statement.pdf")
print(f"Detected Institution: {company}")

# Handle unknown institutions
if company == "unknown":
    print("⚠️  Institution not recognized")
    print("Consider adding detection rules for this institution")

# Examples of detected institutions:
# "selcom", "vodacom", "airtel", "absa", "crdb", "nmb", etc.

Returns: String with institution code

  • Returns standardized institution codes (e.g., "selcom", "vodacom")
  • Returns "unknown" if institution cannot be detected

5. extract_pdf_font_data() - Font Information Extraction

Purpose: Extracts comprehensive font information from PDF files.

import sverification

# Extract font data
font_data = sverification.extract_pdf_font_data("statement.pdf")

# Access font information
print(f"PDF Version: {font_data['pdf_version']}")
print(f"Total Fonts: {font_data['total_no_of_fonts']}")
print(f"Font Names: {font_data['font_names']}")
print(f"Info Object: {font_data['info_object']}")

# Example output:
# {
#   'pdf_version': 'PDF-1.4',
#   'total_no_of_fonts': 2,
#   'font_names': ['Helvetica', 'AZHGJL+ArialMT'],
#   'info_object': '20 0 R'
# }

Returns: Dictionary with font information

  • pdf_version: PDF version from font perspective
  • total_no_of_fonts: Number of fonts used in the PDF
  • font_names: List of font names/identifiers
  • info_object: PDF info object reference

6. compare_font_data() - Font Comparison

Purpose: Compares extracted font data against expected font template.

import sverification

# Extract font data and load templates
font_data = sverification.extract_pdf_font_data("statement.pdf")
font_templates = sverification.load_font_data("statements_font_data.json")
company = sverification.get_company_name("statement.pdf")

# Get expected font template
expected_font = font_templates.get(company.lower(), [{}])[0]

# Compare font data
font_results, font_score = sverification.compare_font_data(font_data, expected_font)

print(f"Font Score: {font_score:.1f}%")
print("\nFont comparison results:")

for field_name, expected_val, actual_val, is_match in font_results:
    status = "✓ PASS" if is_match else "✗ FAIL"
    print(f"{status} {field_name}")
    print(f"  Expected: {expected_val}")
    print(f"  Actual:   {actual_val}")
    print()

Returns: Tuple of (results_list, percentage_score)

  • results_list: List of tuples (field, expected, actual, match_bool)
  • percentage_score: Float between 0-100

7. load_font_data() - Font Template Management

Purpose: Loads font templates for comparison.

import sverification

# Load font templates
font_data = sverification.load_font_data("statements_font_data.json")

# Check available font templates
print("Available font templates:")
for brand_code, templates in font_data.items():
    print(f"  - {brand_code}: {len(templates)} template(s)")

# Get font template for specific institution
selcom_font_templates = font_data.get("selcom", [])
if selcom_font_templates:
    template = selcom_font_templates[0]  # Use first template
    print(f"Expected PDF version: {template.get('pdf_version')}")
    print(f"Expected font count: {template.get('total_no_of_fonts')}")
    print(f"Expected fonts: {template.get('font_names')}")

Returns: Dictionary mapping institution codes to font template lists

8. load_brands() - Metadata Template Management

Purpose: Loads institution templates for comparison.

import sverification

# Load default templates
brands = sverification.load_brands("statements_metadata.json")

# Check available institutions
print("Available institutions:")
for brand_code, templates in brands.items():
    print(f"  - {brand_code}: {len(templates)} template(s)")

# Get template for specific institution
selcom_templates = brands.get("selcom", [])
if selcom_templates:
    template = selcom_templates[0]  # Use first template
    print(f"Expected PDF version for Selcom: {template.get('pdf_version')}")
    print(f"Expected creator: {template.get('creator')}")

Returns: Dictionary mapping institution codes to template lists

9. compare_fields() - Metadata Field Comparison

Purpose: Compares extracted metadata against expected template.

import sverification

# Extract metadata and load templates
metadata = sverification.extract_all("statement.pdf")
brands = sverification.load_brands("statements_metadata.json")
company = sverification.get_company_name("statement.pdf")

# Get expected template
expected = brands.get(company.lower(), [{}])[0]

# Compare fields
results, score = sverification.compare_fields(metadata, expected)

print(f"Overall Score: {score:.1f}%")
print("\nField-by-field results:")

for field_name, expected_val, actual_val, is_match in results:
    status = "✓ PASS" if is_match else "✗ FAIL"
    print(f"{status} {field_name}")
    print(f"  Expected: {expected_val}")
    print(f"  Actual:   {actual_val}")
    print()

Returns: Tuple of (results_list, percentage_score)

  • results_list: List of tuples (field, expected, actual, match_bool)
  • percentage_score: Float between 0-100

🔄 Common Workflows

Batch Processing with Font Analysis

import sverification
import os

def process_directory_with_fonts(pdf_directory):
    """Process all PDFs in a directory with font analysis"""
    results = []
    
    for filename in os.listdir(pdf_directory):
        if filename.endswith('.pdf'):
            pdf_path = os.path.join(pdf_directory, filename)
            
            try:
                result = sverification.verify_statement_verbose(pdf_path)
                results.append({
                    'file': filename,
                    'brand': result['detected_brand'],
                    'combined_score': result['combined_score'],
                    'metadata_score': result['verification_score'],
                    'font_score': result['font_score']
                })
                print(f"✓ {filename}: Combined {result['combined_score']:.1f}% (Meta: {result['verification_score']:.1f}%, Font: {result['font_score']:.1f}%)")
            except Exception as e:
                print(f"✗ {filename}: Error - {e}")
    
    return results

# Process all PDFs with enhanced analysis
results = process_directory_with_fonts("./statements/")

Font Quality Analysis

import sverification

def analyze_font_quality(pdf_path):
    """Analyze font quality and consistency"""
    try:
        font_data = sverification.extract_pdf_font_data(pdf_path)
        company = sverification.get_company_name(pdf_path)
        
        issues = []
        
        # Check for embedded fonts (potential security issue)
        embedded_fonts = [f for f in font_data.get('font_names', []) if '+' in f]
        if embedded_fonts:
            issues.append(f"Embedded fonts detected: {embedded_fonts}")
        
        # Check for unusual font count
        font_count = font_data.get('total_no_of_fonts', 0)
        if font_count > 5:
            issues.append(f"High font count: {font_count} fonts")
        elif font_count == 0:
            issues.append("No fonts detected")
        
        return {
            'company': company,
            'font_data': font_data,
            'issues': issues
        }
    except Exception as e:
        return {'error': str(e)}

# Analyze font quality
analysis = analyze_font_quality("statement.pdf")
if 'error' not in analysis:
    print(f"Institution: {analysis['company']}")
    print(f"Font Count: {analysis['font_data']['total_no_of_fonts']}")
    if analysis['issues']:
        print("⚠️  Font issues:")
        for issue in analysis['issues']:
            print(f"  - {issue}")
    else:
        print("✓ No font issues detected")

Custom Analysis

import sverification

def analyze_statement_quality(pdf_path):
    """Analyze statement quality indicators"""
    metadata = sverification.extract_all(pdf_path)
    company = sverification.get_company_name(pdf_path)
    
    issues = []
    
    # Check for multiple EOF markers (potential tampering)
    if metadata['eof_markers'] > 1:
        issues.append("Multiple EOF markers detected")
    
    # Check for date inconsistencies
    if metadata['creationdate'] != metadata['moddate']:
        issues.append("Creation and modification dates differ")
    
    # Check for unknown institution
    if company == "unknown":
        issues.append("Institution not recognized")
    
    return {
        'company': company,
        'issues': issues,
        'metadata': metadata
    }

# Analyze a statement
analysis = analyze_statement_quality("statement.pdf")
print(f"Institution: {analysis['company']}")
if analysis['issues']:
    print("⚠️  Issues found:")
    for issue in analysis['issues']:
        print(f"  - {issue}")
else:
    print("✓ No issues detected")

🔍 What's Verified

Metadata Analysis

  • PDF Version: Document format version
  • Creation/Modification Dates: Timestamp consistency
  • Creator/Producer: Software used to generate the PDF
  • EOF Markers: Security indicators (multiple markers may indicate tampering)
  • Document Properties: Author, subject, keywords, trapped status

Font Analysis (NEW!)

  • Font Count: Number of fonts used in the document
  • Font Names: Specific fonts and their identifiers
  • Font Embedding: Detection of embedded vs. system fonts
  • PDF Version Consistency: Cross-verification with metadata
  • Font Info Objects: Internal PDF reference validation

Combined Scoring

The package now provides three types of scores:

  • Metadata Score: Traditional metadata verification (0-100%)
  • Font Score: Font consistency verification (0-100%)
  • Combined Score: Weighted combination of both analyses

🏦 Supported Institutions

Banks: ABSA, CRDB, DTB, Exim, NMB, NBC, TCB, UBA
Mobile Money: Airtel, Tigo, Vodacom, Halotel, Selcom
Others: Azam Pesa, PayMaart, and more...

🧪 Testing

# Run tests
pytest

# Run with coverage
pytest --cov=sverification

📄 License

Proprietary software licensed under Black Swan AI Global. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sverification-0.1.2.tar.gz (27.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sverification-0.1.2-py3-none-any.whl (27.1 kB view details)

Uploaded Python 3

File details

Details for the file sverification-0.1.2.tar.gz.

File metadata

  • Download URL: sverification-0.1.2.tar.gz
  • Upload date:
  • Size: 27.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for sverification-0.1.2.tar.gz
Algorithm Hash digest
SHA256 1c470b60ef438438e790df1f55cd5eadc39bc75d67562bc5247214d8666741e2
MD5 31742f398bff76d8156f75ae3bae3e16
BLAKE2b-256 7dd6fb67a78bc2f3a164b4f0564fb63668445a35f13b67a28cee7b5d2644c264

See more details on using hashes here.

File details

Details for the file sverification-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: sverification-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 27.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for sverification-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2eaf7a49025795f2398919cc8ec5af3b18b147338d49a4bd5b28cf6289862672
MD5 109a86eb472c33f986885b818db88443
BLAKE2b-256 fc124973e5f20f0b714f66c952210b94115000d70d44c48dd708f1fd50fef699

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page