File type identification and validation for document processing workflows
Project description
Document Processing Hub
A Python library for intelligent file type identification and validation for document processing workflows.
Features
- File Type Identification: Automatically detect file types based on naming patterns
- Format Validation: Validate file naming conventions for specific formats:
- ZJ Format:
ZJ{YYMMDD}{DD}-{version}(e.g.,ZJ26042804-8170.xlsx) - Manpower Files: Detect and classify Manpower Budget, Documents, and general types
- Real-time Production: Identify real-time production reports
- ZJ Format:
- Duplicate Handling: Support for files with duplicate suffixes like
(1),(2), etc. - Extensible: Easy to add new file type validators
Installation
From PyPI
pip install documentprocessinghub
From GitHub
git clone https://github.com/LJD-UwU/Document-Processing-Hub.git
cd Document-Processing-Hub
pip install .
Development Install
git clone https://github.com/LJD-UwU/Document-Processing-Hub.git
cd Document-Processing-Hub
pip install -e .
Quick Start
from documentprocessinghub import identify_file_type, get_file_info
# Identify file type
file_type = identify_file_type("ZJ26042804-8170.xlsx")
print(file_type) # Output: "ZJ"
# Get detailed file information
info = get_file_info("Manpower Budget-Rev 18.1.xlsx")
print(info)
# Output: {
# "file_name": "Manpower Budget-Rev 18.1.xlsx",
# "file_type": "MANPOWER_BUDGET",
# "recognized": True
# }
Supported File Types
ZJ Format
Files following the pattern: ZJ{YYMMDD}{DD}-{version}{.ext}
Examples:
ZJ26042804-8170.xlsx✓ ValidZJ26042804-8170 (1).xlsx✓ Valid (with duplicate suffix)ZJ26043007-v2.pdf✗ Invalid (letters in version)ZJ26042804-04-8170.xlsx✗ Invalid (double dash)
Manpower Files
Detected by "MANPOWER" keyword in filename
Types:
MANPOWER_BUDGET: Contains "BUDGET" keywordMANPOWER_DOCUMENTS: Contains "DOCUMENTS" keywordMANPOWER: Generic Manpower file
Real-time Production
Files with "REAL-TIME" and "PRODUCTION" keywords
API Reference
identify_file_type(file_path: str) -> str
Identifies the type of a file based on its name.
Parameters:
file_path(str): Full path or filename
Returns:
str: File type identifier or "ARCHIVO_NO_CONOCIDO" if unrecognized
get_file_info(file_path: str) -> dict
Gets detailed information about a file.
Parameters:
file_path(str): Full path or filename
Returns:
dict: Dictionary with file_name, file_type, and recognized status
Testing
python -m pytest tests/
python tests/test_manpower_types.py
python tests/test_duplicates.py
License
MIT License - See LICENSE file for details
Author
Leonardo J. Diaz (ssalvarezleonardoaa@gmail.com)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file documentprocessinghub_ljd-0.1.0.tar.gz.
File metadata
- Download URL: documentprocessinghub_ljd-0.1.0.tar.gz
- Upload date:
- Size: 5.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ebf0c287a5ece11e92ec11079166870cb297f467e9fc1527b63033ea6ba4ec2f
|
|
| MD5 |
621589dd23f205c823e8767ac9258853
|
|
| BLAKE2b-256 |
04d0a4a02b4a084080fd248919d6fd2ed767a59527efbaa827bfae5590bb1605
|
File details
Details for the file documentprocessinghub_ljd-0.1.0-py3-none-any.whl.
File metadata
- Download URL: documentprocessinghub_ljd-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee30f0d17f99e8bd0dffefa74a2663c1cb2506debf894e38c0a974f6309e9f42
|
|
| MD5 |
be3e8bd6b4fa9373629defa9d0a82dc8
|
|
| BLAKE2b-256 |
7d8e65240912249dcb7cde9f59382c5cb9d7dbcc6134a5c2c39388525e4fd0c7
|