Skip to main content

A Python toolkit for Robotic Process Automation (RPA) and ETL pipelines providing solutions to common automation challenges.

Project description

rpatoolkit

A Python toolkit for Robotic Process Automation (RPA) and ETL pipelines providing solutions to common automation challenges.

Table of Contents

Overview

rpatoolkit is a comprehensive Python library designed to simplify common RPA and ETL tasks. It provides utilities for reading Excel files, manipulating data frames, and handling file system operations with a focus on automation workflows.

Features

  • Excel Reading Utilities: Enhanced Excel reading capabilities with automatic column name cleaning, data type casting, and header row detection
  • Data Frame Utilities: Functions for renaming, reordering, and validating columns in Polars DataFrames/LazyFrames
  • File System Utilities: Tools for creating uniquely named directories with timestamp and UUID
  • Header Detection: Automatic detection of header rows in Excel files with offset data
  • Column Name Cleaning: Automatic cleaning and standardization of column names
  • Multiple Sheet Reading: Support for reading all sheets from an Excel workbook at once

Installation

To install rpatoolkit, you need Python 3.10 or higher. You can install it using pip:

pip install rpatoolkit

Or if you're using uv (recommended):

uv add rpatoolkit

Development Installation

For development purposes:

# Clone the repository
git clone https://github.com/fnxL/rpatoolkit.git
cd rpatoolkit

# Install in development mode
uv sync --dev

Usage

Data Frame Utilities

Reading Excel Files

The main function for reading Excel files with enhanced capabilities:

import polars as pl
from rpatoolkit.df import read_excel

# Basic usage
df = read_excel("data.xlsx")

# Reading a specific sheet
df = read_excel("data.xlsx", sheet_name="Sheet1")

# Casting specific columns to certain data types
df = read_excel("data.xlsx", cast={"date": pl.Date, "value": pl.Float64})

# Reading all sheets at once
all_sheets = read_excel("data.xlsx", read_all_sheets=True)

# Reading Excel with header row at an offset
df = read_excel("data.xlsx", header_row=3)

Finding Header Row

Automatically detect the header row in Excel files where headers are not at the top:

from rpatoolkit.df import find_header_row, read_excel

# Find the header row
header_row_index = find_header_row("data.xlsx")
print(f"Header row found at index: {header_row_index}")

# Read the Excel file using the detected header row
df = read_excel("data.xlsx", header_row=header_row_index)

# Find header row with expected keywords
header_row_index = find_header_row("data.xlsx", expected_keywords=["name", "age", "city"])

Column Utilities

Utilities for manipulating DataFrame columns:

from rpatoolkit.df import rename_columns, reorder_columns, get_missing_columns

# Rename columns
df = rename_columns(df, {"old_name": "new_name", "id": "identifier"})

# Reorder columns
df = reorder_columns(df, ["name", "age", "city"])

# Check for missing columns
required_columns = ["name", "email", "phone"]
missing = get_missing_columns(df, required_columns)
if missing:
    print(f"Missing columns: {missing}")

File System Utilities

Creating Unique Directories

Create uniquely named directories with timestamps and UUIDs:

from rpatoolkit.fs import make_unique_dir

# Create a unique directory with default settings
unique_dir = make_unique_dir(prefix="backup")

# Create a directory with custom settings
unique_dir = make_unique_dir(
    base_path="./exports",
    prefix="report",
    include_date=True,
    include_time=True,
    use_12h_format=False  # 24-hour format
)

# Create a directory without automatically creating it on filesystem
path_only = make_unique_dir(create=False, prefix="temp")

API Documentation

Data Frame Module (rpatoolkit.df)

read_excel

Reads an Excel file into a Polars LazyFrame with enhanced functionality.

def read_excel(
    source: FileSource,
    *,
    sheet_id: int | None = None,
    sheet_name: str | None = None,
    table_name: str | None = None,
    engine: ExcelSpreadsheetEngine = "calamine",
    engine_options: dict[str, Any] | None = None,
    read_options: dict[str, Any] | None = None,
    has_header: bool = True,
    columns: Sequence[int] | Sequence[str] | str | None = None,
    schema_overrides: SchemaDict | None = None,
    infer_schema_length: int | None = 100,
    include_file_paths: str | None = None,
    drop_empty_rows: bool = False,
    drop_empty_cols: bool = False,
    raise_if_empty: bool = True,
    header_row: int | None = None,
    cast: dict[str, pl.DataType] | None = None,
    read_all_sheets: bool = False,
    lower_column_names: bool = True,
    clean_column_names: bool = False,
) -> pl.LazyFrame | dict[str, pl.LazyFrame]:

Parameters:

  • source: Path to the Excel file or file-like object to read
  • sheet_id: Sheet number to read (cannot be used with sheet_name)
  • sheet_name: Sheet name to read (cannot be used with sheet_id)
  • table_name: Name of a specific table to read
  • engine: Library used to parse the spreadsheet file ('calamine', 'openpyxl', 'xlsx2csv')
  • engine_options: Additional options passed to the underlying engine
  • read_options: Options passed to the underlying engine method that reads the sheet data
  • has_header: Whether the sheet has a header row
  • columns: Columns to read from the sheet
  • schema_overrides: Support type specification or override of one or more columns
  • infer_schema_length: Number of rows to infer the schema from
  • drop_empty_rows: Remove empty rows from the result
  • drop_empty_cols: Remove empty columns from the result
  • raise_if_empty: Raise an exception if the resulting DataFrame is empty
  • header_row: Row number to use as header (0-indexed)
  • cast: Dictionary mapping column names to desired data types for casting
  • read_all_sheets: Read all sheets in the Excel workbook
  • lower_column_names: Convert column names to lowercase
  • clean_column_names: Clean column names by stripping punctuation

Returns:

  • LazyFrame or dict[str, LazyFrame] if reading multiple sheets

find_header_row

Finds the header row in an Excel file by identifying the first row with maximum consecutive non-null values.

def find_header_row(
    source: FileSource,
    sheet_id: int | None = None,
    sheet_name: str | None = None,
    max_rows: int = 200,
    expected_keywords: list[str] | None = None,
) -> int:

Parameters:

  • source: Path to the Excel file or file-like object to read
  • sheet_id: Sheet number to read (cannot be used with sheet_name)
  • sheet_name: Sheet name to read (cannot be used with sheet_id)
  • max_rows: Maximum number of rows to scan for header identification
  • expected_keywords: List of keywords to look for in the header row

Returns:

  • int: The zero-based index of the header row

rename_columns

Rename columns of a Polars DataFrame or LazyFrame.

def rename_columns(
    df: pl.DataFrame | pl.LazyFrame,
    columns_map: dict,
    strict: bool = True
) -> pl.DataFrame | pl.LazyFrame:

reorder_columns

Reorder columns of a Polars DataFrame or LazyFrame.

def reorder_columns(
    df: pl.DataFrame | pl.LazyFrame,
    columns_order: list[str]
) -> pl.DataFrame | pl.LazyFrame:

get_missing_columns

Check if a Polars DataFrame or LazyFrame contains all required columns.

def get_missing_columns(
    df: pl.DataFrame | pl.LazyFrame,
    required_columns: list[str]
) -> list[str]:

File System Module (rpatoolkit.fs)

make_unique_dir

Creates a unique directory with an optionally formatted name based on current date/time and UUID.

def make_unique_dir(
    base_path: str | Path = ".",
    prefix: str | None = "",
    suffix: str | None = str(uuid4()),
    include_date: bool = True,
    include_time: bool = True,
    use_12h_format: bool = True,
    separator: str = "_",
    time_separator: str = ".",
    date_separator: str = ".",
    create: bool = True,
) -> Path:

Parameters:

  • base_path: The base directory path where the unique directory will be created
  • prefix: A prefix to add to the directory name
  • suffix: A suffix to add to the directory name (defaults to UUID4)
  • include_date: Whether to include the current date in the directory name
  • include_time: Whether to include the current time in the directory name
  • use_12h_format: Whether to use 12-hour format for time (with AM/PM)
  • separator: The separator to use between different parts of the directory name
  • time_separator: The separator to use between time components
  • date_separator: The separator to use between date components
  • create: Whether to actually create the directory on the filesystem

Dependencies

This project uses the following dependencies:

  • fastexcel (>=0.16.0) - For efficient Excel file reading
  • polars (>=1.35.2) - For DataFrame operations
  • xlsxwriter (>=3.2.9) - For Excel file writing capabilities

Development dependencies:

  • pytest (>=9.0.1) - For testing

Contributing

Contributions are welcome! Here's how you can contribute:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests if applicable
  5. Run the tests: pytest
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

Running Tests

To run the tests:

pytest

Development Setup

# Install development dependencies
uv sync --dev

# Run tests
pytest

License

This project is licensed under the MIT License. See the LICENSE file for details.

Copyright (c) 2025 fnxL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rpatoolkit-0.1.0.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rpatoolkit-0.1.0-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file rpatoolkit-0.1.0.tar.gz.

File metadata

  • Download URL: rpatoolkit-0.1.0.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.2

File hashes

Hashes for rpatoolkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7e0a8f9a505c9679556996d68dfffc5b6c479cbc3633ea05792834417eb983fc
MD5 daf499f9f366f2e065c5b0be129b9605
BLAKE2b-256 ff737ab1628e0d916b37fa4c10a76c8ad6c5fe73cf300f961fce871d1c8a9dbe

See more details on using hashes here.

File details

Details for the file rpatoolkit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: rpatoolkit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.2

File hashes

Hashes for rpatoolkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8c65fb6262965a6d45d9ccdef9e1274eb62206327f0a7f79a82d7a9bb3545ce3
MD5 defe09b60d6210ebd22854be34cc6421
BLAKE2b-256 f42237097e506465cababe9010291fd95c10e76b6defed3af7855e7bee18cbac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page