File type identification and validation for document processing workflows
Project description
Document Processing Hub
A Python library for searching, copying, and moving document files intelligently. Automatically find the latest file of a specific type and perform operations on it.
๐ Features
- Smart File Search: Find the latest file of a specific type in predefined locations or custom folders
- Intelligent Sorting: Files are ordered by relevance (date, version, etc.) per type
- Copy & Move Operations: Perform file operations with simple method chaining
- Two Search Modes:
search_file.exists- Search in predefined system locationssearch_file.local- Search in user-specified folders
- Multiple File Types: Support for ZJ, Manpower Budget, Manpower Documents, Manpower, Real-time Production
- Zero Dependencies: Pure Python implementation with no external dependencies
๐ฆ Installation
From PyPI
pip install documentprocessinghub-ljd
From GitHub
git clone https://github.com/LJD-UwU/Document-Processing-Hub.git
cd Document-Processing-Hub
pip install .
๐ Quick Start
from documentprocessinghub import search_file
# Find latest file in predefined locations
ruta = search_file.exists.manpower_budget()
print(f"Found: {ruta}")
# Find and copy to another location
resultado = search_file.exists.manpower_budget(r"C:\Backup").copy()
print(f"Copied to: {resultado}")
# Find in local folder and move to another
resultado = search_file.local.zj(r"C:\Documentos", r"C:\Procesados").move()
print(f"Moved to: {resultado}")
๐ API Reference: search_file
Overview
The search_file API provides two modes for finding and manipulating files:
search_file.exists- Search in predefined system locations (fast & automatic)search_file.local- Search in folders you specify (flexible & explicit)
Return types:
- Without destination: Returns
str(file path) - With destination: Returns
FileResultobject (for.copy()or.move())
Mode 1: search_file.exists - Predefined Locations
Only manpower_budget is available in exists mode.
Example 1: Get File Path Only
from documentprocessinghub import search_file
# Returns the path of the latest Manpower Budget file found in predefined locations
ruta = search_file.exists.manpower_budget()
if ruta:
print(f"Latest file: {ruta}")
# Output: C:\Sistema\Archivos\Manpower Budget Rev 18.2.xlsx
else:
print("No file found")
# Use it in your code
procesar_archivo(ruta)
Example 2: Find and Copy
# Find latest file and copy it (original remains in place)
backup_path = search_file.exists.manpower_budget(r"C:\Backups").copy()
print(f"Backed up to: {backup_path}")
# Output: C:\Backups\Manpower Budget Rev 18.2.xlsx
# Practical use: Daily backup
from datetime import date
today = date.today().strftime("%Y%m%d")
backup_folder = f"C:\\Backups\\{today}"
search_file.exists.manpower_budget(backup_folder).copy()
Example 3: Find and Move
# Find latest file and move it to another location
procesado_path = search_file.exists.manpower_budget(r"C:\Procesados").move()
print(f"Moved to: {procesado_path}")
# Output: C:\Procesados\Manpower Budget Rev 18.2.xlsx
# Note: The file is removed from original location
Mode 2: search_file.local - Custom Folders
All file types are available in local mode:
manpower_budgetmanpower_documentszjmanpowerreal_time_production
Example 1: Get File Path from Local Folder
# Search in a specific folder and get the latest file
ruta = search_file.local.zj(r"C:\Documentos")
if ruta:
print(f"Latest ZJ file: {ruta}")
# Output: C:\Documentos\ZJ26042912-8105.xlsx
else:
print("No ZJ files found in folder")
# Multiple searches
zj_file = search_file.local.zj(r"C:\Docs")
budget_file = search_file.local.manpower_budget(r"C:\Docs")
docs_file = search_file.local.manpower_documents(r"C:\Docs")
print(f"ZJ: {zj_file}")
print(f"Budget: {budget_file}")
print(f"Documents: {docs_file}")
Example 2: Search Local and Copy
# Find in one folder and copy to another
copia_path = search_file.local.manpower_budget(
r"C:\Documentos\Entrada",
r"C:\Documentos\Copia"
).copy()
print(f"Copied to: {copia_path}")
# Output: C:\Documentos\Copia\Manpower Budget Rev 18.1.xlsx
# Practical use: Process and backup
search_file.local.zj(
r"C:\Entrada",
r"C:\Backup"
).copy() # Backup before processing
Example 3: Search Local and Move
# Find in source folder and move to destination
procesado_path = search_file.local.manpower_documents(
r"C:\Documentos\Entrada",
r"C:\Documentos\Procesados"
).move()
print(f"Moved to: {procesado_path}")
# Output: C:\Documentos\Procesados\Manpower Documents Q1 2026.xlsx
# Practical use: Processing pipeline
for carpeta_entrada in [r"C:\Q1", r"C:\Q2", r"C:\Q3"]:
resultado = search_file.local.manpower_documents(
carpeta_entrada,
r"C:\Procesados"
).move()
if resultado:
print(f"Procesado: {resultado}")
Example 4: Process Multiple File Types
# Search for different types in the same folder
origen = r"C:\Documentos"
destino = r"C:\Procesados"
# Process each type
zj = search_file.local.zj(origen, destino).move()
budget = search_file.local.manpower_budget(origen, destino).move()
docs = search_file.local.manpower_documents(origen, destino).move()
# Log results
if zj:
print(f"ZJ: {zj}")
if budget:
print(f"Budget: {budget}")
if docs:
print(f"Documents: {docs}")
File Selection Criteria
The "latest" file is selected based on the file type:
ZJ Files: By date, version, and duplicate count
ZJ26042912-8105(4) > ZJ26042912-8105(1) > ZJ26042822-8005
โ newer โ same date โ older date
โ higher duplicate
MANPOWER_BUDGET: By version and month
Rev 18.2 > Rev 18.1 > Rev 17.0 April
โ newer โ same major version
MANPOWER_DOCUMENTS: By year, month, and quarter
2026 Q2 > 2026 Q1 > 2025 Q4
โ newer โ newer in same year
MANPOWER: By month and day
April_29 > April_28 > March_28
โ newer โ newer in month
REAL_TIME_PRODUCTION: By modification date (most recent first)
๐ก Common Patterns
Pattern 1: Daily Backup
from documentprocessinghub import search_file
from datetime import date
def daily_backup():
today = date.today().strftime("%Y%m%d")
backup_folder = f"C:\\Backups\\{today}"
ruta = search_file.exists.manpower_budget(backup_folder).copy()
if ruta:
print(f"โ Backup successful: {ruta}")
else:
print("โ No file found to backup")
# Run daily
daily_backup()
Pattern 2: Processing Pipeline
from documentprocessinghub import search_file
def process_documents():
entrada = r"C:\Entrada"
procesados = r"C:\Procesados"
# Process each type
for tipo in ["zj", "manpower_budget", "manpower_documents"]:
# Get the function dynamically
search_func = getattr(search_file.local, tipo)
resultado = search_func(entrada, procesados).move()
if resultado:
print(f"Procesado ({tipo}): {resultado}")
process_documents()
Pattern 3: Safe Backup Before Processing
from documentprocessinghub import search_file
def safe_process(carpeta_entrada):
# Step 1: Backup the file
backup = search_file.local.manpower_budget(
carpeta_entrada,
r"C:\Backup"
).copy()
if not backup:
print("โ Error: No file found")
return
# Step 2: Process the original
procesado = search_file.local.manpower_budget(
carpeta_entrada,
r"C:\Procesados"
).move()
print(f"โ Backed up: {backup}")
print(f"โ Processed: {procesado}")
safe_process(r"C:\Entrada")
๐๏ธ Project Structure
document-processing-hub/
โโโ documentprocessinghub/ # Main package
โ โโโ __init__.py # Package initialization
โ โโโ fileSelector.py # search_file API implementation
โ โโโ scanNameFiles.py # File type identification
โ โโโ validators.py # Format validation
โ โโโ paths_config.py # Predefined paths configuration
โโโ examples/ # Usage examples
โ โโโ main.py # Interactive examples
โ โโโ USAGE.md # Detailed usage guide
โโโ pyproject.toml # Project configuration
โโโ README.md # This file
โโโ LICENSE # MIT License
โโโ .gitignore # Git ignore rules
๐ Changelog
Version 0.4.0 (2026-04-30)
Major Changes
- Renamed API:
find_latest_fileโsearch_file(clearer intent) - Simplified behavior:
- Without destination: Returns
str(file path) - With destination: Returns
FileResultfor.copy()or.move()
- Without destination: Returns
- Restricted exists mode: Only
manpower_budgetavailable insearch_file.exists - Enhanced documentation: Complete docstrings for IDE support
- All types in local: All file types available in
search_file.local
Version 0.3.1 (2026-04-30)
Fixes
- Fixed FileResult.copy() missing destination argument
- Improved API parameter handling
Version 0.3.0 (2026-04-29)
New Features
- Fluent API with dynamic methods for each file type
- FileResult class for file operations
Version 0.2.0 (2026-04-29)
New Features
- Initial file search functionality
- Support for multiple file types
- Smart file selection based on date and version
๐ License
MIT License - See LICENSE file for details
๐จโ๐ป Author
LJD-UwU
- Email: himexpe.interns@hisense.com
- GitHub: @LJD-UwU
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ง Data Processing: process_file
The process_file module provides tools for cleaning and processing Excel data.
clean_sheet() - Clean and Flatten Excel Data
Automatically detects headers, cleans data, and creates a structured "Datos_Limpios" sheet.
Features:
- โ Automatic header detection
- โ Data normalization and cleaning
- โ Professional formatting (colors, borders, frozen headers)
- โ Intelligent column width adjustment
- โ Removes empty rows and duplicate columns
- โ Returns pandas DataFrame for further analysis
Usage:
from documentprocessinghub import clean_sheet
# Option 1: Clean and overwrite
df = clean_sheet("datos.xlsx")
# Option 2: Save to new file
df = clean_sheet("entrada.xlsx", output_path="salida.xlsx")
# Option 3: Process specific sheet
df = clean_sheet("datos.xlsx", nombre_hoja="Producciรณn")
# Result is a pandas DataFrame
print(df.shape) # (rows, columns)
print(df.columns) # Column names
print(df.head()) # First rows
What It Does:
- Reads the Excel file
- Detects headers and data rows
- Cleans data (removes nulls, normalizes columns)
- Applies professional formatting
- Creates "Datos_Limpios" sheet with cleaned data
- Returns DataFrame for analysis
๐ References
Made with care for document processing automation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file documentprocessinghub_ljd-0.5.0.tar.gz.
File metadata
- Download URL: documentprocessinghub_ljd-0.5.0.tar.gz
- Upload date:
- Size: 19.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c12d8abe0fc414c19d32160a58d9c3f8520a5eb1a190725571ed518b5864611c
|
|
| MD5 |
348bc7999e7d0eaba709eafc8994e24e
|
|
| BLAKE2b-256 |
d791ee3c708a09135f0a882fa32c921a982084c87bc463cb85a04c2110e74ee8
|
File details
Details for the file documentprocessinghub_ljd-0.5.0-py3-none-any.whl.
File metadata
- Download URL: documentprocessinghub_ljd-0.5.0-py3-none-any.whl
- Upload date:
- Size: 16.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
578b88cd12fdd85dfcb7fad2764f0e63b793973cf3fb5e8811a36a8867fe4e2f
|
|
| MD5 |
20d9a722ce3af771025bf9a2237068f6
|
|
| BLAKE2b-256 |
f4e57a4018501a71748a65faa4aa5a86679efc2b2eb62c469ececfef4880e6a6
|