Skip to main content

File type identification and validation for document processing workflows

Project description

Document Processing Hub

PyPI version Python 3.10+ License: MIT

A Python library for searching, copying, and moving document files intelligently. Automatically find the latest file of a specific type and perform operations on it.

๐ŸŒŸ Features

  • Smart File Search: Find the latest file of a specific type in predefined locations or custom folders
  • Intelligent Sorting: Files are ordered by relevance (date, version, etc.) per type
  • Copy & Move Operations: Perform file operations with simple method chaining
  • Two Search Modes:
    • search_file.exists - Search in predefined system locations
    • search_file.local - Search in user-specified folders
  • Multiple File Types: Support for ZJ, Manpower Budget, Manpower Documents, Manpower, Real-time Production
  • Zero Dependencies: Pure Python implementation with no external dependencies

๐Ÿ“ฆ Installation

From PyPI

pip install documentprocessinghub-ljd

From GitHub

git clone https://github.com/LJD-UwU/Document-Processing-Hub.git
cd Document-Processing-Hub
pip install .

๐Ÿš€ Quick Start

from documentprocessinghub import search_file

# Find latest file in predefined locations
ruta = search_file.exists.manpower_budget()
print(f"Found: {ruta}")

# Find and copy to another location
resultado = search_file.exists.manpower_budget(r"C:\Backup").copy()
print(f"Copied to: {resultado}")

# Find in local folder and move to another
resultado = search_file.local.zj(r"C:\Documentos", r"C:\Procesados").move()
print(f"Moved to: {resultado}")

๐Ÿ” API Reference: search_file

Overview

The search_file API provides two modes for finding and manipulating files:

  1. search_file.exists - Search in predefined system locations (fast & automatic)
  2. search_file.local - Search in folders you specify (flexible & explicit)

Return types:

  • Without destination: Returns str (file path)
  • With destination: Returns FileResult object (for .copy() or .move())

Mode 1: search_file.exists - Predefined Locations

Only manpower_budget is available in exists mode.

Example 1: Get File Path Only

from documentprocessinghub import search_file

# Returns the path of the latest Manpower Budget file found in predefined locations
ruta = search_file.exists.manpower_budget()

if ruta:
    print(f"Latest file: {ruta}")
    # Output: C:\Sistema\Archivos\Manpower Budget Rev 18.2.xlsx
else:
    print("No file found")

# Use it in your code
procesar_archivo(ruta)

Example 2: Find and Copy

# Find latest file and copy it (original remains in place)
backup_path = search_file.exists.manpower_budget(r"C:\Backups").copy()

print(f"Backed up to: {backup_path}")
# Output: C:\Backups\Manpower Budget Rev 18.2.xlsx

# Practical use: Daily backup
from datetime import date
today = date.today().strftime("%Y%m%d")
backup_folder = f"C:\\Backups\\{today}"
search_file.exists.manpower_budget(backup_folder).copy()

Example 3: Find and Move

# Find latest file and move it to another location
procesado_path = search_file.exists.manpower_budget(r"C:\Procesados").move()

print(f"Moved to: {procesado_path}")
# Output: C:\Procesados\Manpower Budget Rev 18.2.xlsx

# Note: The file is removed from original location

Mode 2: search_file.local - Custom Folders

All file types are available in local mode:

  • manpower_budget
  • manpower_documents
  • zj
  • manpower
  • real_time_production

Example 1: Get File Path from Local Folder

# Search in a specific folder and get the latest file
ruta = search_file.local.zj(r"C:\Documentos")

if ruta:
    print(f"Latest ZJ file: {ruta}")
    # Output: C:\Documentos\ZJ26042912-8105.xlsx
else:
    print("No ZJ files found in folder")

# Multiple searches
zj_file = search_file.local.zj(r"C:\Docs")
budget_file = search_file.local.manpower_budget(r"C:\Docs")
docs_file = search_file.local.manpower_documents(r"C:\Docs")

print(f"ZJ: {zj_file}")
print(f"Budget: {budget_file}")
print(f"Documents: {docs_file}")

Example 2: Search Local and Copy

# Find in one folder and copy to another
copia_path = search_file.local.manpower_budget(
    r"C:\Documentos\Entrada",
    r"C:\Documentos\Copia"
).copy()

print(f"Copied to: {copia_path}")
# Output: C:\Documentos\Copia\Manpower Budget Rev 18.1.xlsx

# Practical use: Process and backup
search_file.local.zj(
    r"C:\Entrada",
    r"C:\Backup"
).copy()  # Backup before processing

Example 3: Search Local and Move

# Find in source folder and move to destination
procesado_path = search_file.local.manpower_documents(
    r"C:\Documentos\Entrada",
    r"C:\Documentos\Procesados"
).move()

print(f"Moved to: {procesado_path}")
# Output: C:\Documentos\Procesados\Manpower Documents Q1 2026.xlsx

# Practical use: Processing pipeline
for carpeta_entrada in [r"C:\Q1", r"C:\Q2", r"C:\Q3"]:
    resultado = search_file.local.manpower_documents(
        carpeta_entrada,
        r"C:\Procesados"
    ).move()
    if resultado:
        print(f"Procesado: {resultado}")

Example 4: Process Multiple File Types

# Search for different types in the same folder
origen = r"C:\Documentos"
destino = r"C:\Procesados"

# Process each type
zj = search_file.local.zj(origen, destino).move()
budget = search_file.local.manpower_budget(origen, destino).move()
docs = search_file.local.manpower_documents(origen, destino).move()

# Log results
if zj:
    print(f"ZJ: {zj}")
if budget:
    print(f"Budget: {budget}")
if docs:
    print(f"Documents: {docs}")

File Selection Criteria

The "latest" file is selected based on the file type:

ZJ Files: By date, version, and duplicate count

ZJ26042912-8105(4) > ZJ26042912-8105(1) > ZJ26042822-8005
     โ†‘ newer        โ†‘ same date         โ†‘ older date
                    โ†‘ higher duplicate

MANPOWER_BUDGET: By version and month

Rev 18.2 > Rev 18.1 > Rev 17.0 April
 โ†‘ newer   โ†‘ same major version

MANPOWER_DOCUMENTS: By year, month, and quarter

2026 Q2 > 2026 Q1 > 2025 Q4
โ†‘ newer   โ†‘ newer in same year

MANPOWER: By month and day

April_29 > April_28 > March_28
 โ†‘ newer    โ†‘ newer in month

REAL_TIME_PRODUCTION: By modification date (most recent first)


๐Ÿ’ก Common Patterns

Pattern 1: Daily Backup

from documentprocessinghub import search_file
from datetime import date

def daily_backup():
    today = date.today().strftime("%Y%m%d")
    backup_folder = f"C:\\Backups\\{today}"
    
    ruta = search_file.exists.manpower_budget(backup_folder).copy()
    if ruta:
        print(f"โœ“ Backup successful: {ruta}")
    else:
        print("โœ— No file found to backup")

# Run daily
daily_backup()

Pattern 2: Processing Pipeline

from documentprocessinghub import search_file

def process_documents():
    entrada = r"C:\Entrada"
    procesados = r"C:\Procesados"
    
    # Process each type
    for tipo in ["zj", "manpower_budget", "manpower_documents"]:
        # Get the function dynamically
        search_func = getattr(search_file.local, tipo)
        
        resultado = search_func(entrada, procesados).move()
        if resultado:
            print(f"Procesado ({tipo}): {resultado}")

process_documents()

Pattern 3: Safe Backup Before Processing

from documentprocessinghub import search_file

def safe_process(carpeta_entrada):
    # Step 1: Backup the file
    backup = search_file.local.manpower_budget(
        carpeta_entrada,
        r"C:\Backup"
    ).copy()
    
    if not backup:
        print("โœ— Error: No file found")
        return
    
    # Step 2: Process the original
    procesado = search_file.local.manpower_budget(
        carpeta_entrada,
        r"C:\Procesados"
    ).move()
    
    print(f"โœ“ Backed up: {backup}")
    print(f"โœ“ Processed: {procesado}")

safe_process(r"C:\Entrada")

๐Ÿ—๏ธ Project Structure

document-processing-hub/
โ”œโ”€โ”€ documentprocessinghub/          # Main package
โ”‚   โ”œโ”€โ”€ __init__.py                # Package initialization
โ”‚   โ”œโ”€โ”€ fileSelector.py            # search_file API implementation
โ”‚   โ”œโ”€โ”€ scanNameFiles.py           # File type identification
โ”‚   โ”œโ”€โ”€ validators.py              # Format validation
โ”‚   โ””โ”€โ”€ paths_config.py            # Predefined paths configuration
โ”œโ”€โ”€ examples/                       # Usage examples
โ”‚   โ”œโ”€โ”€ main.py                    # Interactive examples
โ”‚   โ””โ”€โ”€ USAGE.md                   # Detailed usage guide
โ”œโ”€โ”€ pyproject.toml                 # Project configuration
โ”œโ”€โ”€ README.md                       # This file
โ”œโ”€โ”€ LICENSE                         # MIT License
โ””โ”€โ”€ .gitignore                      # Git ignore rules

๐Ÿ“ Changelog

Version 0.4.0 (2026-04-30)

Major Changes

  • Renamed API: find_latest_file โ†’ search_file (clearer intent)
  • Simplified behavior:
    • Without destination: Returns str (file path)
    • With destination: Returns FileResult for .copy() or .move()
  • Restricted exists mode: Only manpower_budget available in search_file.exists
  • Enhanced documentation: Complete docstrings for IDE support
  • All types in local: All file types available in search_file.local

Version 0.3.1 (2026-04-30)

Fixes

  • Fixed FileResult.copy() missing destination argument
  • Improved API parameter handling

Version 0.3.0 (2026-04-29)

New Features

  • Fluent API with dynamic methods for each file type
  • FileResult class for file operations

Version 0.2.0 (2026-04-29)

New Features

  • Initial file search functionality
  • Support for multiple file types
  • Smart file selection based on date and version

๐Ÿ“„ License

MIT License - See LICENSE file for details

๐Ÿ‘จโ€๐Ÿ’ป Author

LJD-UwU

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


๐Ÿ”ง Data Processing: process_file

The process_file module provides tools for cleaning and processing Excel data.

clean_sheet() - Clean and Flatten Excel Data

Automatically detects headers, cleans data, and creates a structured "Datos_Limpios" sheet.

Features:

  • โœ… Automatic header detection
  • โœ… Data normalization and cleaning
  • โœ… Professional formatting (colors, borders, frozen headers)
  • โœ… Intelligent column width adjustment
  • โœ… Removes empty rows and duplicate columns
  • โœ… Returns pandas DataFrame for further analysis

Usage:

from documentprocessinghub import clean_sheet

# Option 1: Clean and overwrite
df = clean_sheet("datos.xlsx")

# Option 2: Save to new file
df = clean_sheet("entrada.xlsx", output_path="salida.xlsx")

# Option 3: Process specific sheet
df = clean_sheet("datos.xlsx", nombre_hoja="Producciรณn")

# Result is a pandas DataFrame
print(df.shape)      # (rows, columns)
print(df.columns)    # Column names
print(df.head())     # First rows

What It Does:

  1. Reads the Excel file
  2. Detects headers and data rows
  3. Cleans data (removes nulls, normalizes columns)
  4. Applies professional formatting
  5. Creates "Datos_Limpios" sheet with cleaned data
  6. Returns DataFrame for analysis

๐Ÿ“š References


Made with care for document processing automation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

documentprocessinghub_ljd-0.5.0.tar.gz (19.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

documentprocessinghub_ljd-0.5.0-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file documentprocessinghub_ljd-0.5.0.tar.gz.

File metadata

File hashes

Hashes for documentprocessinghub_ljd-0.5.0.tar.gz
Algorithm Hash digest
SHA256 c12d8abe0fc414c19d32160a58d9c3f8520a5eb1a190725571ed518b5864611c
MD5 348bc7999e7d0eaba709eafc8994e24e
BLAKE2b-256 d791ee3c708a09135f0a882fa32c921a982084c87bc463cb85a04c2110e74ee8

See more details on using hashes here.

File details

Details for the file documentprocessinghub_ljd-0.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for documentprocessinghub_ljd-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 578b88cd12fdd85dfcb7fad2764f0e63b793973cf3fb5e8811a36a8867fe4e2f
MD5 20d9a722ce3af771025bf9a2237068f6
BLAKE2b-256 f4e57a4018501a71748a65faa4aa5a86679efc2b2eb62c469ececfef4880e6a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page