File type identification and validation for document processing workflows
Project description
Document Processing Hub
A powerful Python library for intelligent file type identification and validation for document processing workflows. Automatically detect and classify files based on sophisticated naming pattern matching and validation rules.
🌟 Features
- Intelligent File Type Identification: Automatically detect file types based on naming patterns
- Format Validation: Validate file naming conventions for specific formats:
- ZJ Format:
ZJ{YYMMDD}{DD}-{version}with strict validation (e.g.,ZJ26042804-8170.xlsx) - Manpower Files: Detect and classify Manpower Budget, Documents, and general types
- Real-time Production: Identify real-time production reports
- ZJ Format:
- Duplicate File Handling: Support for files with duplicate suffixes like
(1),(2), etc. - Comprehensive Error Handling: Clear classification of invalid files (e.g.,
ZJ_NO_CLASIFICADO_FALTA_DE_DATOS) - Extensible Architecture: Easy to add new file type validators
- Zero Dependencies: Pure Python implementation with no external dependencies
📦 Installation
From PyPI
pip install documentprocessinghub-ljd
From GitHub
git clone https://github.com/LJD-UwU/Document-Processing-Hub.git
cd Document-Processing-Hub
pip install .
Development Install
git clone https://github.com/LJD-UwU/Document-Processing-Hub.git
cd Document-Processing-Hub
pip install -e .
🚀 Quick Start
Basic Usage
from documentprocessinghub import identify_file_type, get_file_info
# Identify file type
file_type = identify_file_type("ZJ26042804-8170.xlsx")
print(file_type) # Output: "ZJ"
# Get detailed file information
info = get_file_info("Manpower Budget-Rev 18.1.xlsx")
print(info)
# Output: {
# "file_name": "Manpower Budget-Rev 18.1.xlsx",
# "file_type": "MANPOWER_BUDGET",
# "recognized": True
# }
Batch Processing
from documentprocessinghub import identify_file_type
files = [
"ZJ26042804-8170.xlsx",
"Manpower Budget-Rev 18.1.xlsx",
"Manpower Documents Q1 2026.xlsx",
"Real-time production report.xlsx",
]
for file in files:
file_type = identify_file_type(file)
print(f"{file}: {file_type}")
# Output:
# ZJ26042804-8170.xlsx: ZJ
# Manpower Budget-Rev 18.1.xlsx: MANPOWER_BUDGET
# Manpower Documents Q1 2026.xlsx: MANPOWER_DOCUMENTS
# Real-time production report.xlsx: REAL_TIME_PRODUCTION
📋 Supported File Types
ZJ Format
Files following the strict pattern: ZJ{YYMMDD}{DD}-{version}{.ext}
- Format Components:
ZJ: Mandatory prefix at the startYYMMDD: Date components (6 digits) - Year (2), Month (2), Day (2)DD: Version prefix (2 digits)-: Mandatory separator{version}: Version identifier (numbers only, no letters).{ext}: File extension (optional)
Valid Examples:
ZJ26042804-8170.xlsx✓ZJ25120115-4320.csv✓ZJ26042810-1234.pdf✓ZJ26042804-8170 (1).xlsx✓ (with duplicate suffix)ZJ26042804-8170 (999).xlsx✓ (with duplicate suffix)
Invalid Examples:
ZJ26043007-v2.pdf✗ (letters in version)ZJ26042804-04-8170.xlsx✗ (double dash)ZJ260428-8170.xlsx✗ (missing 2 version digits)ZJProduction.csv✗ (missing date and version)ZJ26042804-8170(1).xlsx✗ (missing space before parenthesis)
Classification Result: "ZJ" or "ZJ_NO_CLASIFICADO_FALTA_DE_DATOS"
Manpower Files
Detected by "MANPOWER" keyword in filename, with subtype classification.
Subtypes:
-
MANPOWER_BUDGET - Contains "BUDGET" keyword
- Examples:
Manpower Budget-Rev 18.1.xlsx,Manpower Budget (1).xlsx
- Examples:
-
MANPOWER_DOCUMENTS - Contains "DOCUMENTS" keyword
- Examples:
Manpower Documents Q1 2026.xlsx,Manpower Documents Q1 2026 (2).xlsx
- Examples:
-
MANPOWER - Generic Manpower file
- Examples:
MANPOWER_Extraction_Data.xlsx,Manpower_Q2_Report.xlsx
- Examples:
Duplicate Handling: All Manpower files support duplicate suffixes
Manpower Budget (1).xlsx→MANPOWER_BUDGETManpower Documents Q1 (2).xlsx→MANPOWER_DOCUMENTSMANPOWER_Data (1).xlsx→MANPOWER
Classification Results: "MANPOWER_BUDGET", "MANPOWER_DOCUMENTS", "MANPOWER"
Real-time Production
Files with both "REAL-TIME" and "PRODUCTION" keywords.
Examples:
Real-time production report.xlsxReal-time Production Data 2026-04.csv
Classification Result: "REAL_TIME_PRODUCTION"
Unrecognized Files
Any file that doesn't match the above patterns.
Classification Result: "ARCHIVO_NO_CONOCIDO"
🔧 API Reference
identify_file_type(file_path: str) -> str
Identifies the type of a file based on its name.
Parameters:
file_path(str): Full path or filename
Returns:
str: File type identifier
Possible Return Values:
"ZJ"- Valid ZJ format file"ZJ_NO_CLASIFICADO_FALTA_DE_DATOS"- ZJ format file with invalid format"MANPOWER_BUDGET"- Manpower Budget file"MANPOWER_DOCUMENTS"- Manpower Documents file"MANPOWER"- Generic Manpower file"REAL_TIME_PRODUCTION"- Real-time production report"ARCHIVO_NO_CONOCIDO"- Unrecognized file type
Example:
from documentprocessinghub import identify_file_type
result = identify_file_type("ZJ26042804-8170.xlsx")
print(result) # Output: "ZJ"
result = identify_file_type("invalid-file.txt")
print(result) # Output: "ARCHIVO_NO_CONOCIDO"
get_file_info(file_path: str) -> dict
Gets detailed information about a file including recognition status.
Parameters:
file_path(str): Full path or filename
Returns:
dict: Dictionary containing:file_name(str): The filenamefile_type(str): The identified typerecognized(bool): Whether the file type was recognized
Example:
from documentprocessinghub import get_file_info
info = get_file_info("Manpower Budget.xlsx")
print(info)
# Output: {
# "file_name": "Manpower Budget.xlsx",
# "file_type": "MANPOWER_BUDGET",
# "recognized": True
# }
info = get_file_info("unknown.txt")
print(info)
# Output: {
# "file_name": "unknown.txt",
# "file_type": "ARCHIVO_NO_CONOCIDO",
# "recognized": False
# }
Validator Functions
Advanced users can access individual validators directly:
from documentprocessinghub import (
validate_zj_format,
validate_manpower_budget,
validate_manpower_documents,
classify_manpower_type
)
# Validate ZJ format
is_valid = validate_zj_format("ZJ26042804-8170.XLSX")
# Returns: True
is_valid = validate_zj_format("ZJ26043007-v2.XLSX")
# Returns: False
# Classify Manpower type
manpower_type = classify_manpower_type("MANPOWER BUDGET.XLSX")
# Returns: "MANPOWER_BUDGET"
# Check specific type
is_budget = validate_manpower_budget("BUDGET_REPORT.XLSX")
# Returns: True
is_documents = validate_manpower_documents("DOCUMENTS_2026.XLSX")
# Returns: True
🧪 Testing
The package includes comprehensive test suites to verify all functionality.
Run All Tests
python -m pytest tests/ -v
Run Specific Test Suite
# Test basic file type identification
python tests/test_manpower_types.py
# Test duplicate suffix handling
python tests/test_duplicates.py
Test Coverage
The test suite covers:
- Valid and invalid ZJ format cases
- Manpower file classification (Budget, Documents, Generic)
- Duplicate suffix handling
(1),(2), etc. - Real-time production file detection
- Unrecognized file handling
- Error classification for malformed files
💡 Use Cases
1. Document Processing Pipeline
from documentprocessinghub import get_file_info
from pathlib import Path
# Process files in a directory
input_dir = Path("documents/")
for file in input_dir.glob("*"):
info = get_file_info(file.name)
if info["recognized"]:
file_type = info["file_type"]
# Route to appropriate processor based on file type
if file_type == "MANPOWER_BUDGET":
process_budget_file(file)
elif file_type == "ZJ":
process_zj_file(file)
2. File Validation and Routing
from documentprocessinghub import identify_file_type
def validate_and_route_file(file_path):
file_type = identify_file_type(file_path)
if file_type == "ZJ_NO_CLASIFICADO_FALTA_DE_DATOS":
# Handle invalid ZJ format
log_error(f"Invalid ZJ format: {file_path}")
move_to_quarantine(file_path)
elif file_type == "ARCHIVO_NO_CONOCIDO":
# Handle unknown file type
move_to_unknown_folder(file_path)
else:
# Process recognized file
process_file(file_path, file_type)
3. Automated File Organization
from documentprocessinghub import identify_file_type
from pathlib import Path
import shutil
def organize_files(source_dir, target_dir):
source = Path(source_dir)
target = Path(target_dir)
for file in source.iterdir():
file_type = identify_file_type(file.name)
# Create subdirectory for each file type
type_dir = target / file_type
type_dir.mkdir(parents=True, exist_ok=True)
# Move file to type-specific directory
shutil.move(str(file), str(type_dir / file.name))
🏗️ Project Structure
document-processing-hub/
├── documentprocessinghub/ # Main package
│ ├── __init__.py # Package initialization and exports
│ ├── validators.py # Format validation functions
│ └── scanNameFiles.py # File type identification logic
├── tests/ # Comprehensive test suite
│ ├── test_manpower_types.py # Manpower and ZJ format tests
│ └── test_duplicates.py # Duplicate suffix handling tests
├── examples/ # Usage examples
│ └── main.py # Example script
├── pyproject.toml # Project configuration for PyPI
├── README.md # This file
├── LICENSE # MIT License
└── .gitignore # Git ignore rules
🔧 Advanced Configuration
Adding Custom File Types
You can extend the library by adding custom validators:
# In documentprocessinghub/validators.py
import re
def validate_custom_format(file_name: str) -> bool:
"""Validate custom file format: CUSTOM{YYYY}-{version}"""
pattern = r'^CUSTOM\d{4}-\d+$'
return bool(re.match(pattern, file_name))
# In documentprocessinghub/scanNameFiles.py
from .validators import validate_custom_format
# Add to identify_file_type() function:
case _ if file_name.startswith("CUSTOM"):
if validate_custom_format(file_name):
return "CUSTOM"
else:
return "CUSTOM_NO_CLASIFICADO"
find_latest_file(folder_path: str, file_type: str) -> Optional[str]
Busca el archivo más actual de un tipo específico en una carpeta.
Parameters:
folder_path(str): Ruta de la carpeta a buscarfile_type(str): Tipo de archivo ("ZJ", "MANPOWER_BUDGET", "MANPOWER_DOCUMENTS", "MANPOWER", "REAL_TIME_PRODUCTION")
Returns:
str: Ruta completa del archivo más actualNone: Si no encuentra archivos del tipo especificado
Criterios de selección por tipo:
ZJ Format:
- Ordena por fecha (YYMMDD)
- Si la fecha es igual, ordena por número de versión
- Si hay duplicados con (N), elige el número más alto
- Ejemplo:
ZJ26042912-8105(4)>ZJ26042912-8105(1)>ZJ26042822-8005
MANPOWER_BUDGET:
- Ordena por versión mayor (Rev 18.1 > Rev 17)
- Si la versión es igual, ordena por mes
- Prioriza versión sobre mes
- Ejemplo:
Rev 18.1>Rev 16.1 April>Rev 15 April
MANPOWER_DOCUMENTS:
- Ordena por año
- Si el año es igual, ordena por mes
- Soporta formato "Q1 2026", "April 2026", etc.
MANPOWER:
- Ordena por fecha (Month_Day)
- Ejemplo:
Manpower_April_29>Manpower_April_28>Manpower_March_28
Example:
from documentprocessinalhub import find_latest_file
# Buscar el archivo ZJ más actual en una carpeta
latest_zj = find_latest_file("/path/to/folder", "ZJ")
print(latest_zj) # Output: "/path/to/folder/ZJ26042912-8105(4).xlsx"
# Buscar el Manpower Budget más reciente
latest_budget = find_latest_file("/path/to/documents", "MANPOWER_BUDGET")
print(latest_budget) # Output: "/path/to/documents/副本Manpower Budget-Rev 18.1.xlsx"
# Si no hay archivos del tipo
result = find_latest_file("/path/to/folder", "REAL_TIME_PRODUCTION")
print(result) # Output: None
📝 Changelog
Version 0.2.0 (2026-04-29)
New Features
Features:
- find_latest_file(): Busca el archivo más actual de un tipo específico en una carpeta
- Soporte inteligente para selección de archivos basado en fecha y versión
- ZJ: Ordena por fecha y número de versión
- MANPOWER_BUDGET: Ordena por versión del archivo
- MANPOWER_DOCUMENTS: Ordena por año y mes
- MANPOWER: Ordena por fecha (Month_Day)
- Manejo de duplicados con sufijos (N)
- Retorna
Nonesi no encuentra archivos del tipo
Version 0.1.0 (2026-04-29)
Initial Release
Features:
- ZJ format validation with strict pattern matching
- Manpower file classification (Budget, Documents, Generic)
- Real-time production file detection
- Duplicate file suffix support
(1),(2), etc. - Comprehensive error handling and classification
- Full API documentation and examples
- Extensive test coverage
📄 License
MIT License - See LICENSE file for details
👨💻 Author
LJD-UwU
- Email: himexpe.interns@hisense.com
- GitHub: @LJD-UwU
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Development Setup
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes and add tests
- Run tests:
python -m pytest tests/ -v - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
📚 References
❓ FAQ
Q: What happens if a file has an invalid ZJ format?
A: The file is classified as "ZJ_NO_CLASIFICADO_FALTA_DE_DATOS" (ZJ unclassified due to missing data).
Q: Does the library handle both Windows and Linux file paths?
A: Yes, it uses pathlib.Path which is cross-platform compatible.
Q: Can I use this for real-time file monitoring? A: Yes, the library is fast and can be integrated into file monitoring systems. It has no dependencies and performs pattern matching on filenames only.
Q: Is the library case-sensitive?
A: No, it converts filenames to uppercase before processing, so ZJ26042804-8170.xlsx and zj26042804-8170.xlsx are treated the same.
Q: How can I contribute a new file type validator?
A: Fork the repository, add your validator to validators.py, add tests, and submit a Pull Request.
Made with care for document processing automation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file documentprocessinghub_ljd-0.2.0.tar.gz.
File metadata
- Download URL: documentprocessinghub_ljd-0.2.0.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b65557472addc27a184b535cb77ff3e884730fb030f690959408fdcfa339a49f
|
|
| MD5 |
fabfe2128decc6e58257aa263b556730
|
|
| BLAKE2b-256 |
61bcfc430f3a66951659843737c5135c348bd4399ac5e93dab692bb32d546606
|
File details
Details for the file documentprocessinghub_ljd-0.2.0-py3-none-any.whl.
File metadata
- Download URL: documentprocessinghub_ljd-0.2.0-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35a64ba2da2ac5fbbde4344e806841f82833e33e710abb30e0de2d6fba61509e
|
|
| MD5 |
15178748b2ffccc14fac5bf3794369a5
|
|
| BLAKE2b-256 |
7d6920d0eca648b63f85f4ad160403852a6f2d98acbd18a85a6b1cc7d7337848
|