Skip to main content

File type identification and validation for document processing workflows

Project description

Document Processing Hub

PyPI version Python 3.10+ License: MIT

A powerful Python library for intelligent file type identification and validation for document processing workflows. Automatically detect and classify files based on sophisticated naming pattern matching and validation rules.

🌟 Features

  • Intelligent File Type Identification: Automatically detect file types based on naming patterns
  • Format Validation: Validate file naming conventions for specific formats:
    • ZJ Format: ZJ{YYMMDD}{DD}-{version} with strict validation (e.g., ZJ26042804-8170.xlsx)
    • Manpower Files: Detect and classify Manpower Budget, Documents, and general types
    • Real-time Production: Identify real-time production reports
  • Duplicate File Handling: Support for files with duplicate suffixes like (1), (2), etc.
  • Comprehensive Error Handling: Clear classification of invalid files (e.g., ZJ_NO_CLASIFICADO_FALTA_DE_DATOS)
  • Extensible Architecture: Easy to add new file type validators
  • Zero Dependencies: Pure Python implementation with no external dependencies

📦 Installation

From PyPI

pip install documentprocessinghub-ljd

From GitHub

git clone https://github.com/LJD-UwU/Document-Processing-Hub.git
cd Document-Processing-Hub
pip install .

Development Install

git clone https://github.com/LJD-UwU/Document-Processing-Hub.git
cd Document-Processing-Hub
pip install -e .

🚀 Quick Start

Basic Usage

from documentprocessinghub import identify_file_type, get_file_info

# Identify file type
file_type = identify_file_type("ZJ26042804-8170.xlsx")
print(file_type)  # Output: "ZJ"

# Get detailed file information
info = get_file_info("Manpower Budget-Rev 18.1.xlsx")
print(info)
# Output: {
#     "file_name": "Manpower Budget-Rev 18.1.xlsx",
#     "file_type": "MANPOWER_BUDGET",
#     "recognized": True
# }

Batch Processing

from documentprocessinghub import identify_file_type

files = [
    "ZJ26042804-8170.xlsx",
    "Manpower Budget-Rev 18.1.xlsx",
    "Manpower Documents Q1 2026.xlsx",
    "Real-time production report.xlsx",
]

for file in files:
    file_type = identify_file_type(file)
    print(f"{file}: {file_type}")

# Output:
# ZJ26042804-8170.xlsx: ZJ
# Manpower Budget-Rev 18.1.xlsx: MANPOWER_BUDGET
# Manpower Documents Q1 2026.xlsx: MANPOWER_DOCUMENTS
# Real-time production report.xlsx: REAL_TIME_PRODUCTION

📋 Supported File Types

ZJ Format

Files following the strict pattern: ZJ{YYMMDD}{DD}-{version}{.ext}

  • Format Components:
    • ZJ: Mandatory prefix at the start
    • YYMMDD: Date components (6 digits) - Year (2), Month (2), Day (2)
    • DD: Version prefix (2 digits)
    • -: Mandatory separator
    • {version}: Version identifier (numbers only, no letters)
    • .{ext}: File extension (optional)

Valid Examples:

  • ZJ26042804-8170.xlsx
  • ZJ25120115-4320.csv
  • ZJ26042810-1234.pdf
  • ZJ26042804-8170 (1).xlsx ✓ (with duplicate suffix)
  • ZJ26042804-8170 (999).xlsx ✓ (with duplicate suffix)

Invalid Examples:

  • ZJ26043007-v2.pdf ✗ (letters in version)
  • ZJ26042804-04-8170.xlsx ✗ (double dash)
  • ZJ260428-8170.xlsx ✗ (missing 2 version digits)
  • ZJProduction.csv ✗ (missing date and version)
  • ZJ26042804-8170(1).xlsx ✗ (missing space before parenthesis)

Classification Result: "ZJ" or "ZJ_NO_CLASIFICADO_FALTA_DE_DATOS"

Manpower Files

Detected by "MANPOWER" keyword in filename, with subtype classification.

Subtypes:

  1. MANPOWER_BUDGET - Contains "BUDGET" keyword

    • Examples: Manpower Budget-Rev 18.1.xlsx, Manpower Budget (1).xlsx
  2. MANPOWER_DOCUMENTS - Contains "DOCUMENTS" keyword

    • Examples: Manpower Documents Q1 2026.xlsx, Manpower Documents Q1 2026 (2).xlsx
  3. MANPOWER - Generic Manpower file

    • Examples: MANPOWER_Extraction_Data.xlsx, Manpower_Q2_Report.xlsx

Duplicate Handling: All Manpower files support duplicate suffixes

  • Manpower Budget (1).xlsxMANPOWER_BUDGET
  • Manpower Documents Q1 (2).xlsxMANPOWER_DOCUMENTS
  • MANPOWER_Data (1).xlsxMANPOWER

Classification Results: "MANPOWER_BUDGET", "MANPOWER_DOCUMENTS", "MANPOWER"

Real-time Production

Files with both "REAL-TIME" and "PRODUCTION" keywords.

Examples:

  • Real-time production report.xlsx
  • Real-time Production Data 2026-04.csv

Classification Result: "REAL_TIME_PRODUCTION"

Unrecognized Files

Any file that doesn't match the above patterns.

Classification Result: "ARCHIVO_NO_CONOCIDO"

🔧 API Reference

identify_file_type(file_path: str) -> str

Identifies the type of a file based on its name.

Parameters:

  • file_path (str): Full path or filename

Returns:

  • str: File type identifier

Possible Return Values:

  • "ZJ" - Valid ZJ format file
  • "ZJ_NO_CLASIFICADO_FALTA_DE_DATOS" - ZJ format file with invalid format
  • "MANPOWER_BUDGET" - Manpower Budget file
  • "MANPOWER_DOCUMENTS" - Manpower Documents file
  • "MANPOWER" - Generic Manpower file
  • "REAL_TIME_PRODUCTION" - Real-time production report
  • "ARCHIVO_NO_CONOCIDO" - Unrecognized file type

Example:

from documentprocessinghub import identify_file_type

result = identify_file_type("ZJ26042804-8170.xlsx")
print(result)  # Output: "ZJ"

result = identify_file_type("invalid-file.txt")
print(result)  # Output: "ARCHIVO_NO_CONOCIDO"

get_file_info(file_path: str) -> dict

Gets detailed information about a file including recognition status.

Parameters:

  • file_path (str): Full path or filename

Returns:

  • dict: Dictionary containing:
    • file_name (str): The filename
    • file_type (str): The identified type
    • recognized (bool): Whether the file type was recognized

Example:

from documentprocessinghub import get_file_info

info = get_file_info("Manpower Budget.xlsx")
print(info)
# Output: {
#     "file_name": "Manpower Budget.xlsx",
#     "file_type": "MANPOWER_BUDGET",
#     "recognized": True
# }

info = get_file_info("unknown.txt")
print(info)
# Output: {
#     "file_name": "unknown.txt",
#     "file_type": "ARCHIVO_NO_CONOCIDO",
#     "recognized": False
# }

Validator Functions

Advanced users can access individual validators directly:

from documentprocessinghub import (
    validate_zj_format,
    validate_manpower_budget,
    validate_manpower_documents,
    classify_manpower_type
)

# Validate ZJ format
is_valid = validate_zj_format("ZJ26042804-8170.XLSX")
# Returns: True

is_valid = validate_zj_format("ZJ26043007-v2.XLSX")
# Returns: False

# Classify Manpower type
manpower_type = classify_manpower_type("MANPOWER BUDGET.XLSX")
# Returns: "MANPOWER_BUDGET"

# Check specific type
is_budget = validate_manpower_budget("BUDGET_REPORT.XLSX")
# Returns: True

is_documents = validate_manpower_documents("DOCUMENTS_2026.XLSX")
# Returns: True

🧪 Testing

The package includes comprehensive test suites to verify all functionality.

Run All Tests

python -m pytest tests/ -v

Run Specific Test Suite

# Test basic file type identification
python tests/test_manpower_types.py

# Test duplicate suffix handling
python tests/test_duplicates.py

Test Coverage

The test suite covers:

  • Valid and invalid ZJ format cases
  • Manpower file classification (Budget, Documents, Generic)
  • Duplicate suffix handling (1), (2), etc.
  • Real-time production file detection
  • Unrecognized file handling
  • Error classification for malformed files

💡 Use Cases

1. Document Processing Pipeline

from documentprocessinghub import get_file_info
from pathlib import Path

# Process files in a directory
input_dir = Path("documents/")
for file in input_dir.glob("*"):
    info = get_file_info(file.name)
    
    if info["recognized"]:
        file_type = info["file_type"]
        # Route to appropriate processor based on file type
        if file_type == "MANPOWER_BUDGET":
            process_budget_file(file)
        elif file_type == "ZJ":
            process_zj_file(file)

2. File Validation and Routing

from documentprocessinghub import identify_file_type

def validate_and_route_file(file_path):
    file_type = identify_file_type(file_path)
    
    if file_type == "ZJ_NO_CLASIFICADO_FALTA_DE_DATOS":
        # Handle invalid ZJ format
        log_error(f"Invalid ZJ format: {file_path}")
        move_to_quarantine(file_path)
    elif file_type == "ARCHIVO_NO_CONOCIDO":
        # Handle unknown file type
        move_to_unknown_folder(file_path)
    else:
        # Process recognized file
        process_file(file_path, file_type)

3. Automated File Organization

from documentprocessinghub import identify_file_type
from pathlib import Path
import shutil

def organize_files(source_dir, target_dir):
    source = Path(source_dir)
    target = Path(target_dir)
    
    for file in source.iterdir():
        file_type = identify_file_type(file.name)
        
        # Create subdirectory for each file type
        type_dir = target / file_type
        type_dir.mkdir(parents=True, exist_ok=True)
        
        # Move file to type-specific directory
        shutil.move(str(file), str(type_dir / file.name))

🏗️ Project Structure

document-processing-hub/
├── documentprocessinghub/          # Main package
│   ├── __init__.py                # Package initialization and exports
│   ├── validators.py              # Format validation functions
│   └── scanNameFiles.py           # File type identification logic
├── tests/                         # Comprehensive test suite
│   ├── test_manpower_types.py    # Manpower and ZJ format tests
│   └── test_duplicates.py        # Duplicate suffix handling tests
├── examples/                      # Usage examples
│   └── main.py                   # Example script
├── pyproject.toml                # Project configuration for PyPI
├── README.md                      # This file
├── LICENSE                        # MIT License
└── .gitignore                     # Git ignore rules

🔧 Advanced Configuration

Adding Custom File Types

You can extend the library by adding custom validators:

# In documentprocessinghub/validators.py
import re

def validate_custom_format(file_name: str) -> bool:
    """Validate custom file format: CUSTOM{YYYY}-{version}"""
    pattern = r'^CUSTOM\d{4}-\d+$'
    return bool(re.match(pattern, file_name))
# In documentprocessinghub/scanNameFiles.py
from .validators import validate_custom_format

# Add to identify_file_type() function:
case _ if file_name.startswith("CUSTOM"):
    if validate_custom_format(file_name):
        return "CUSTOM"
    else:
        return "CUSTOM_NO_CLASIFICADO"

find_latest_file(folder_path: str, file_type: str) -> Optional[str]

Busca el archivo más actual de un tipo específico en una carpeta.

Parameters:

  • folder_path (str): Ruta de la carpeta a buscar
  • file_type (str): Tipo de archivo ("ZJ", "MANPOWER_BUDGET", "MANPOWER_DOCUMENTS", "MANPOWER", "REAL_TIME_PRODUCTION")

Returns:

  • str: Ruta completa del archivo más actual
  • None: Si no encuentra archivos del tipo especificado

Criterios de selección por tipo:

ZJ Format:

  • Ordena por fecha (YYMMDD)
  • Si la fecha es igual, ordena por número de versión
  • Si hay duplicados con (N), elige el número más alto
  • Ejemplo: ZJ26042912-8105(4) > ZJ26042912-8105(1) > ZJ26042822-8005

MANPOWER_BUDGET:

  • Ordena por versión mayor (Rev 18.1 > Rev 17)
  • Si la versión es igual, ordena por mes
  • Prioriza versión sobre mes
  • Ejemplo: Rev 18.1 > Rev 16.1 April > Rev 15 April

MANPOWER_DOCUMENTS:

  • Ordena por año
  • Si el año es igual, ordena por mes
  • Soporta formato "Q1 2026", "April 2026", etc.

MANPOWER:

  • Ordena por fecha (Month_Day)
  • Ejemplo: Manpower_April_29 > Manpower_April_28 > Manpower_March_28

Example:

from documentprocessinalhub import find_latest_file

# Buscar el archivo ZJ más actual en una carpeta
latest_zj = find_latest_file("/path/to/folder", "ZJ")
print(latest_zj)  # Output: "/path/to/folder/ZJ26042912-8105(4).xlsx"

# Buscar el Manpower Budget más reciente
latest_budget = find_latest_file("/path/to/documents", "MANPOWER_BUDGET")
print(latest_budget)  # Output: "/path/to/documents/副本Manpower Budget-Rev 18.1.xlsx"

# Si no hay archivos del tipo
result = find_latest_file("/path/to/folder", "REAL_TIME_PRODUCTION")
print(result)  # Output: None

📝 Changelog

Version 0.2.0 (2026-04-29)

New Features

Features:

  • find_latest_file(): Busca el archivo más actual de un tipo específico en una carpeta
  • Soporte inteligente para selección de archivos basado en fecha y versión
  • ZJ: Ordena por fecha y número de versión
  • MANPOWER_BUDGET: Ordena por versión del archivo
  • MANPOWER_DOCUMENTS: Ordena por año y mes
  • MANPOWER: Ordena por fecha (Month_Day)
  • Manejo de duplicados con sufijos (N)
  • Retorna None si no encuentra archivos del tipo

Version 0.1.0 (2026-04-29)

Initial Release

Features:

  • ZJ format validation with strict pattern matching
  • Manpower file classification (Budget, Documents, Generic)
  • Real-time production file detection
  • Duplicate file suffix support (1), (2), etc.
  • Comprehensive error handling and classification
  • Full API documentation and examples
  • Extensive test coverage

📄 License

MIT License - See LICENSE file for details

👨‍💻 Author

LJD-UwU

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes and add tests
  4. Run tests: python -m pytest tests/ -v
  5. Commit changes (git commit -m 'Add amazing feature')
  6. Push to branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

📚 References

❓ FAQ

Q: What happens if a file has an invalid ZJ format? A: The file is classified as "ZJ_NO_CLASIFICADO_FALTA_DE_DATOS" (ZJ unclassified due to missing data).

Q: Does the library handle both Windows and Linux file paths? A: Yes, it uses pathlib.Path which is cross-platform compatible.

Q: Can I use this for real-time file monitoring? A: Yes, the library is fast and can be integrated into file monitoring systems. It has no dependencies and performs pattern matching on filenames only.

Q: Is the library case-sensitive? A: No, it converts filenames to uppercase before processing, so ZJ26042804-8170.xlsx and zj26042804-8170.xlsx are treated the same.

Q: How can I contribute a new file type validator? A: Fork the repository, add your validator to validators.py, add tests, and submit a Pull Request.


Made with care for document processing automation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

documentprocessinghub_ljd-0.2.0.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

documentprocessinghub_ljd-0.2.0-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file documentprocessinghub_ljd-0.2.0.tar.gz.

File metadata

File hashes

Hashes for documentprocessinghub_ljd-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b65557472addc27a184b535cb77ff3e884730fb030f690959408fdcfa339a49f
MD5 fabfe2128decc6e58257aa263b556730
BLAKE2b-256 61bcfc430f3a66951659843737c5135c348bd4399ac5e93dab692bb32d546606

See more details on using hashes here.

File details

Details for the file documentprocessinghub_ljd-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for documentprocessinghub_ljd-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 35a64ba2da2ac5fbbde4344e806841f82833e33e710abb30e0de2d6fba61509e
MD5 15178748b2ffccc14fac5bf3794369a5
BLAKE2b-256 7d6920d0eca648b63f85f4ad160403852a6f2d98acbd18a85a6b1cc7d7337848

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page