File type identification and validation for document processing workflows

These details have not been verified by PyPI

Project links

Project description

Document Processing Hub

A powerful Python library for intelligent file type identification and validation for document processing workflows. Automatically detect and classify files based on sophisticated naming pattern matching and validation rules.

🌟 Features

Intelligent File Type Identification: Automatically detect file types based on naming patterns
Format Validation: Validate file naming conventions for specific formats:
- ZJ Format: ZJ{YYMMDD}{DD}-{version} with strict validation (e.g., ZJ26042804-8170.xlsx)
- Manpower Files: Detect and classify Manpower Budget, Documents, and general types
- Real-time Production: Identify real-time production reports
Duplicate File Handling: Support for files with duplicate suffixes like (1), (2), etc.
Comprehensive Error Handling: Clear classification of invalid files (e.g., ZJ_NO_CLASIFICADO_FALTA_DE_DATOS)
Extensible Architecture: Easy to add new file type validators
Zero Dependencies: Pure Python implementation with no external dependencies

📦 Installation

From PyPI

pip install documentprocessinghub-ljd

From GitHub

git clone https://github.com/LJD-UwU/Document-Processing-Hub.git
cd Document-Processing-Hub
pip install .

Development Install

git clone https://github.com/LJD-UwU/Document-Processing-Hub.git
cd Document-Processing-Hub
pip install -e .

🚀 Quick Start

Basic Usage

from documentprocessinghub import identify_file_type, get_file_info

# Identify file type
file_type = identify_file_type("ZJ26042804-8170.xlsx")
print(file_type)  # Output: "ZJ"

# Get detailed file information
info = get_file_info("Manpower Budget-Rev 18.1.xlsx")
print(info)
# Output: {
#     "file_name": "Manpower Budget-Rev 18.1.xlsx",
#     "file_type": "MANPOWER_BUDGET",
#     "recognized": True
# }

Batch Processing

from documentprocessinghub import identify_file_type

files = [
    "ZJ26042804-8170.xlsx",
    "Manpower Budget-Rev 18.1.xlsx",
    "Manpower Documents Q1 2026.xlsx",
    "Real-time production report.xlsx",
]

for file in files:
    file_type = identify_file_type(file)
    print(f"{file}: {file_type}")

# Output:
# ZJ26042804-8170.xlsx: ZJ
# Manpower Budget-Rev 18.1.xlsx: MANPOWER_BUDGET
# Manpower Documents Q1 2026.xlsx: MANPOWER_DOCUMENTS
# Real-time production report.xlsx: REAL_TIME_PRODUCTION

📋 Supported File Types

ZJ Format

Files following the strict pattern: ZJ{YYMMDD}{DD}-{version}{.ext}

Format Components:
- ZJ: Mandatory prefix at the start
- YYMMDD: Date components (6 digits) - Year (2), Month (2), Day (2)
- DD: Version prefix (2 digits)
- -: Mandatory separator
- {version}: Version identifier (numbers only, no letters)
- .{ext}: File extension (optional)

Valid Examples:

ZJ26042804-8170.xlsx ✓
ZJ25120115-4320.csv ✓
ZJ26042810-1234.pdf ✓
ZJ26042804-8170 (1).xlsx ✓ (with duplicate suffix)
ZJ26042804-8170 (999).xlsx ✓ (with duplicate suffix)

Invalid Examples:

ZJ26043007-v2.pdf ✗ (letters in version)
ZJ26042804-04-8170.xlsx ✗ (double dash)
ZJ260428-8170.xlsx ✗ (missing 2 version digits)
ZJProduction.csv ✗ (missing date and version)
ZJ26042804-8170(1).xlsx ✗ (missing space before parenthesis)

Classification Result: "ZJ" or "ZJ_NO_CLASIFICADO_FALTA_DE_DATOS"

Manpower Files

Detected by "MANPOWER" keyword in filename, with subtype classification.

Subtypes:

MANPOWER_BUDGET - Contains "BUDGET" keyword
- Examples: Manpower Budget-Rev 18.1.xlsx, Manpower Budget (1).xlsx
MANPOWER_DOCUMENTS - Contains "DOCUMENTS" keyword
- Examples: Manpower Documents Q1 2026.xlsx, Manpower Documents Q1 2026 (2).xlsx
MANPOWER - Generic Manpower file
- Examples: MANPOWER_Extraction_Data.xlsx, Manpower_Q2_Report.xlsx

Duplicate Handling: All Manpower files support duplicate suffixes

Manpower Budget (1).xlsx → MANPOWER_BUDGET
Manpower Documents Q1 (2).xlsx → MANPOWER_DOCUMENTS
MANPOWER_Data (1).xlsx → MANPOWER

Classification Results: "MANPOWER_BUDGET", "MANPOWER_DOCUMENTS", "MANPOWER"

Real-time Production

Files with both "REAL-TIME" and "PRODUCTION" keywords.

Examples:

Real-time production report.xlsx
Real-time Production Data 2026-04.csv

Classification Result: "REAL_TIME_PRODUCTION"

Unrecognized Files

Any file that doesn't match the above patterns.

Classification Result: "ARCHIVO_NO_CONOCIDO"

🔧 API Reference

`identify_file_type(file_path: str) -> str`

Identifies the type of a file based on its name.

Parameters:

file_path (str): Full path or filename

Returns:

str: File type identifier

Possible Return Values:

"ZJ" - Valid ZJ format file
"ZJ_NO_CLASIFICADO_FALTA_DE_DATOS" - ZJ format file with invalid format
"MANPOWER_BUDGET" - Manpower Budget file
"MANPOWER_DOCUMENTS" - Manpower Documents file
"MANPOWER" - Generic Manpower file
"REAL_TIME_PRODUCTION" - Real-time production report
"ARCHIVO_NO_CONOCIDO" - Unrecognized file type

Example:

from documentprocessinghub import identify_file_type

result = identify_file_type("ZJ26042804-8170.xlsx")
print(result)  # Output: "ZJ"

result = identify_file_type("invalid-file.txt")
print(result)  # Output: "ARCHIVO_NO_CONOCIDO"

`get_file_info(file_path: str) -> dict`

Gets detailed information about a file including recognition status.

Parameters:

file_path (str): Full path or filename

Returns:

dict: Dictionary containing:
- file_name (str): The filename
- file_type (str): The identified type
- recognized (bool): Whether the file type was recognized

Example:

from documentprocessinghub import get_file_info

info = get_file_info("Manpower Budget.xlsx")
print(info)
# Output: {
#     "file_name": "Manpower Budget.xlsx",
#     "file_type": "MANPOWER_BUDGET",
#     "recognized": True
# }

info = get_file_info("unknown.txt")
print(info)
# Output: {
#     "file_name": "unknown.txt",
#     "file_type": "ARCHIVO_NO_CONOCIDO",
#     "recognized": False
# }

Validator Functions

Advanced users can access individual validators directly:

from documentprocessinghub import (
    validate_zj_format,
    validate_manpower_budget,
    validate_manpower_documents,
    classify_manpower_type
)

# Validate ZJ format
is_valid = validate_zj_format("ZJ26042804-8170.XLSX")
# Returns: True

is_valid = validate_zj_format("ZJ26043007-v2.XLSX")
# Returns: False

# Classify Manpower type
manpower_type = classify_manpower_type("MANPOWER BUDGET.XLSX")
# Returns: "MANPOWER_BUDGET"

# Check specific type
is_budget = validate_manpower_budget("BUDGET_REPORT.XLSX")
# Returns: True

is_documents = validate_manpower_documents("DOCUMENTS_2026.XLSX")
# Returns: True

🧪 Testing

The package includes comprehensive test suites to verify all functionality.

Run All Tests

python -m pytest tests/ -v

Run Specific Test Suite

# Test basic file type identification
python tests/test_manpower_types.py

# Test duplicate suffix handling
python tests/test_duplicates.py

Test Coverage

The test suite covers:

Valid and invalid ZJ format cases
Manpower file classification (Budget, Documents, Generic)
Duplicate suffix handling (1), (2), etc.
Real-time production file detection
Unrecognized file handling
Error classification for malformed files

💡 Use Cases

1. Document Processing Pipeline

from documentprocessinghub import get_file_info
from pathlib import Path

# Process files in a directory
input_dir = Path("documents/")
for file in input_dir.glob("*"):
    info = get_file_info(file.name)
    
    if info["recognized"]:
        file_type = info["file_type"]
        # Route to appropriate processor based on file type
        if file_type == "MANPOWER_BUDGET":
            process_budget_file(file)
        elif file_type == "ZJ":
            process_zj_file(file)

2. File Validation and Routing

from documentprocessinghub import identify_file_type

def validate_and_route_file(file_path):
    file_type = identify_file_type(file_path)
    
    if file_type == "ZJ_NO_CLASIFICADO_FALTA_DE_DATOS":
        # Handle invalid ZJ format
        log_error(f"Invalid ZJ format: {file_path}")
        move_to_quarantine(file_path)
    elif file_type == "ARCHIVO_NO_CONOCIDO":
        # Handle unknown file type
        move_to_unknown_folder(file_path)
    else:
        # Process recognized file
        process_file(file_path, file_type)

3. Automated File Organization

from documentprocessinghub import identify_file_type
from pathlib import Path
import shutil

def organize_files(source_dir, target_dir):
    source = Path(source_dir)
    target = Path(target_dir)
    
    for file in source.iterdir():
        file_type = identify_file_type(file.name)
        
        # Create subdirectory for each file type
        type_dir = target / file_type
        type_dir.mkdir(parents=True, exist_ok=True)
        
        # Move file to type-specific directory
        shutil.move(str(file), str(type_dir / file.name))

🏗️ Project Structure

document-processing-hub/
├── documentprocessinghub/          # Main package
│   ├── __init__.py                # Package initialization and exports
│   ├── validators.py              # Format validation functions
│   └── scanNameFiles.py           # File type identification logic
├── tests/                         # Comprehensive test suite
│   ├── test_manpower_types.py    # Manpower and ZJ format tests
│   └── test_duplicates.py        # Duplicate suffix handling tests
├── examples/                      # Usage examples
│   └── main.py                   # Example script
├── pyproject.toml                # Project configuration for PyPI
├── README.md                      # This file
├── LICENSE                        # MIT License
└── .gitignore                     # Git ignore rules

🔧 Advanced Configuration

Adding Custom File Types

You can extend the library by adding custom validators:

# In documentprocessinghub/validators.py
import re

def validate_custom_format(file_name: str) -> bool:
    """Validate custom file format: CUSTOM{YYYY}-{version}"""
    pattern = r'^CUSTOM\d{4}-\d+$'
    return bool(re.match(pattern, file_name))

# In documentprocessinghub/scanNameFiles.py
from .validators import validate_custom_format

# Add to identify_file_type() function:
case _ if file_name.startswith("CUSTOM"):
    if validate_custom_format(file_name):
        return "CUSTOM"
    else:
        return "CUSTOM_NO_CLASIFICADO"

`find_latest_file(folder_path: str, file_type: str) -> Optional[str]`

Busca el archivo más actual de un tipo específico en una carpeta.

Parameters:

folder_path (str): Ruta de la carpeta a buscar
file_type (str): Tipo de archivo ("ZJ", "MANPOWER_BUDGET", "MANPOWER_DOCUMENTS", "MANPOWER", "REAL_TIME_PRODUCTION")

Returns:

str: Ruta completa del archivo más actual
None: Si no encuentra archivos del tipo especificado

Criterios de selección por tipo:

ZJ Format:

Ordena por fecha (YYMMDD)
Si la fecha es igual, ordena por número de versión
Si hay duplicados con (N), elige el número más alto
Ejemplo: ZJ26042912-8105(4) > ZJ26042912-8105(1) > ZJ26042822-8005

MANPOWER_BUDGET:

Ordena por versión mayor (Rev 18.1 > Rev 17)
Si la versión es igual, ordena por mes
Prioriza versión sobre mes
Ejemplo: Rev 18.1 > Rev 16.1 April > Rev 15 April

MANPOWER_DOCUMENTS:

Ordena por año
Si el año es igual, ordena por mes
Soporta formato "Q1 2026", "April 2026", etc.

MANPOWER:

Ordena por fecha (Month_Day)
Ejemplo: Manpower_April_29 > Manpower_April_28 > Manpower_March_28

Example:

from documentprocessinalhub import find_latest_file

# Buscar el archivo ZJ más actual en una carpeta
latest_zj = find_latest_file("/path/to/folder", "ZJ")
print(latest_zj)  # Output: "/path/to/folder/ZJ26042912-8105(4).xlsx"

# Buscar el Manpower Budget más reciente
latest_budget = find_latest_file("/path/to/documents", "MANPOWER_BUDGET")
print(latest_budget)  # Output: "/path/to/documents/副本Manpower Budget-Rev 18.1.xlsx"

# Si no hay archivos del tipo
result = find_latest_file("/path/to/folder", "REAL_TIME_PRODUCTION")
print(result)  # Output: None

📝 Changelog

Version 0.2.0 (2026-04-29)

New Features

Features:

find_latest_file(): Busca el archivo más actual de un tipo específico en una carpeta
Soporte inteligente para selección de archivos basado en fecha y versión
ZJ: Ordena por fecha y número de versión
MANPOWER_BUDGET: Ordena por versión del archivo
MANPOWER_DOCUMENTS: Ordena por año y mes
MANPOWER: Ordena por fecha (Month_Day)
Manejo de duplicados con sufijos (N)
Retorna None si no encuentra archivos del tipo

Version 0.1.0 (2026-04-29)

Initial Release

Features:

ZJ format validation with strict pattern matching
Manpower file classification (Budget, Documents, Generic)
Real-time production file detection
Duplicate file suffix support (1), (2), etc.
Comprehensive error handling and classification
Full API documentation and examples
Extensive test coverage

📄 License

MIT License - See LICENSE file for details

👨‍💻 Author

LJD-UwU

Email: himexpe.interns@hisense.com
GitHub: @LJD-UwU

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes and add tests
Run tests: python -m pytest tests/ -v
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

📚 References

❓ FAQ

Q: What happens if a file has an invalid ZJ format? A: The file is classified as "ZJ_NO_CLASIFICADO_FALTA_DE_DATOS" (ZJ unclassified due to missing data).

Q: Does the library handle both Windows and Linux file paths? A: Yes, it uses pathlib.Path which is cross-platform compatible.

Q: Can I use this for real-time file monitoring? A: Yes, the library is fast and can be integrated into file monitoring systems. It has no dependencies and performs pattern matching on filenames only.

Q: Is the library case-sensitive? A: No, it converts filenames to uppercase before processing, so ZJ26042804-8170.xlsx and zj26042804-8170.xlsx are treated the same.

Q: How can I contribute a new file type validator? A: Fork the repository, add your validator to validators.py, add tests, and submit a Pull Request.

Made with care for document processing automation

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

Apr 30, 2026

0.5.0

Apr 30, 2026

0.4.0

Apr 30, 2026

0.3.0

Apr 29, 2026

This version

0.2.0

Apr 29, 2026

0.1.0

Apr 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

documentprocessinghub_ljd-0.2.0.tar.gz (16.4 kB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

documentprocessinghub_ljd-0.2.0-py3-none-any.whl (11.3 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file documentprocessinghub_ljd-0.2.0.tar.gz.

File metadata

Download URL: documentprocessinghub_ljd-0.2.0.tar.gz
Upload date: Apr 29, 2026
Size: 16.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for documentprocessinghub_ljd-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`b65557472addc27a184b535cb77ff3e884730fb030f690959408fdcfa339a49f`
MD5	`fabfe2128decc6e58257aa263b556730`
BLAKE2b-256	`61bcfc430f3a66951659843737c5135c348bd4399ac5e93dab692bb32d546606`

See more details on using hashes here.

File details

Details for the file documentprocessinghub_ljd-0.2.0-py3-none-any.whl.

File metadata

Download URL: documentprocessinghub_ljd-0.2.0-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 11.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for documentprocessinghub_ljd-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`35a64ba2da2ac5fbbde4344e806841f82833e33e710abb30e0de2d6fba61509e`
MD5	`15178748b2ffccc14fac5bf3794369a5`
BLAKE2b-256	`7d6920d0eca648b63f85f4ad160403852a6f2d98acbd18a85a6b1cc7d7337848`

See more details on using hashes here.

documentprocessinghub-ljd 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Document Processing Hub

🌟 Features

📦 Installation

From PyPI

From GitHub

Development Install

🚀 Quick Start

Basic Usage

Batch Processing

📋 Supported File Types

ZJ Format

Manpower Files

Real-time Production

Unrecognized Files

🔧 API Reference

identify_file_type(file_path: str) -> str

get_file_info(file_path: str) -> dict

Validator Functions

🧪 Testing

Run All Tests

Run Specific Test Suite

Test Coverage

💡 Use Cases

1. Document Processing Pipeline

2. File Validation and Routing

3. Automated File Organization

🏗️ Project Structure

🔧 Advanced Configuration

Adding Custom File Types

find_latest_file(folder_path: str, file_type: str) -> Optional[str]

📝 Changelog

Version 0.2.0 (2026-04-29)

Version 0.1.0 (2026-04-29)

📄 License

👨‍💻 Author

🤝 Contributing

Development Setup

📚 References

❓ FAQ

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`identify_file_type(file_path: str) -> str`

`get_file_info(file_path: str) -> dict`

`find_latest_file(folder_path: str, file_type: str) -> Optional[str]`