A Python toolkit for Robotic Process Automation (RPA) and ETL pipelines providing solutions to common automation challenges.

These details have not been verified by PyPI

Project links

Project description

rpatoolkit

A Python toolkit for Robotic Process Automation (RPA) and ETL pipelines providing solutions to common automation challenges.

Overview
Features
Installation
Usage
Data Frame Utilities
File System Utilities
API Documentation
Dependencies
Contributing
License

Overview

rpatoolkit is a comprehensive Python library designed to simplify common RPA and ETL tasks. It provides utilities for reading Excel files, manipulating data frames, and handling file system operations with a focus on automation workflows.

Features

Excel Reading Utilities: Enhanced Excel reading capabilities with automatic column name cleaning, data type casting, and header row detection
Data Frame Utilities: Functions for renaming, reordering, and validating columns in Polars DataFrames/LazyFrames
File System Utilities: Tools for creating uniquely named directories with timestamp and UUID
Header Detection: Automatic detection of header rows in Excel files with offset data
Column Name Cleaning: Automatic cleaning and standardization of column names
Multiple Sheet Reading: Support for reading all sheets from an Excel workbook at once

Installation

To install rpatoolkit, you need Python 3.10 or higher. You can install it using pip:

pip install rpatoolkit

Or if you're using uv (recommended):

uv add rpatoolkit

Development Installation

For development purposes:

# Clone the repository
git clone https://github.com/fnxL/rpatoolkit.git
cd rpatoolkit

# Install in development mode
uv sync --dev

Usage

Data Frame Utilities

Reading Excel Files

The main function for reading Excel files with enhanced capabilities:

import polars as pl
from rpatoolkit.df import read_excel

# Basic usage
df = read_excel("data.xlsx")

# Reading a specific sheet
df = read_excel("data.xlsx", sheet_name="Sheet1")

# Casting specific columns to certain data types
df = read_excel("data.xlsx", cast={"date": pl.Date, "value": pl.Float64})

# Reading all sheets at once
all_sheets = read_excel("data.xlsx", read_all_sheets=True)

# Reading Excel with header row at an offset
df = read_excel("data.xlsx", header_row=3)

Finding Header Row

Automatically detect the header row in Excel files where headers are not at the top:

from rpatoolkit.df import find_header_row, read_excel

# Find the header row
header_row_index = find_header_row("data.xlsx")
print(f"Header row found at index: {header_row_index}")

# Read the Excel file using the detected header row
df = read_excel("data.xlsx", header_row=header_row_index)

# Find header row with expected keywords
header_row_index = find_header_row("data.xlsx", expected_keywords=["name", "age", "city"])

Column Utilities

Utilities for manipulating DataFrame columns:

from rpatoolkit.df import rename_columns, reorder_columns, get_missing_columns

# Rename columns
df = rename_columns(df, {"old_name": "new_name", "id": "identifier"})

# Reorder columns
df = reorder_columns(df, ["name", "age", "city"])

# Check for missing columns
required_columns = ["name", "email", "phone"]
missing = get_missing_columns(df, required_columns)
if missing:
    print(f"Missing columns: {missing}")

File System Utilities

Creating Unique Directories

Create uniquely named directories with timestamps and UUIDs:

from rpatoolkit.fs import make_unique_dir

# Create a unique directory with default settings
unique_dir = make_unique_dir(prefix="backup")

# Create a directory with custom settings
unique_dir = make_unique_dir(
    base_path="./exports",
    prefix="report",
    include_date=True,
    include_time=True,
    use_12h_format=False  # 24-hour format
)

# Create a directory without automatically creating it on filesystem
path_only = make_unique_dir(create=False, prefix="temp")

API Documentation

Data Frame Module (`rpatoolkit.df`)

`read_excel`

Reads an Excel file into a Polars LazyFrame with enhanced functionality.

def read_excel(
    source: FileSource,
    *,
    sheet_id: int | None = None,
    sheet_name: str | None = None,
    table_name: str | None = None,
    engine: ExcelSpreadsheetEngine = "calamine",
    engine_options: dict[str, Any] | None = None,
    read_options: dict[str, Any] | None = None,
    has_header: bool = True,
    columns: Sequence[int] | Sequence[str] | str | None = None,
    schema_overrides: SchemaDict | None = None,
    infer_schema_length: int | None = 100,
    include_file_paths: str | None = None,
    drop_empty_rows: bool = False,
    drop_empty_cols: bool = False,
    raise_if_empty: bool = True,
    header_row: int | None = None,
    cast: dict[str, pl.DataType] | None = None,
    read_all_sheets: bool = False,
    lower_column_names: bool = True,
    clean_column_names: bool = False,
) -> pl.LazyFrame | dict[str, pl.LazyFrame]:

Parameters:

source: Path to the Excel file or file-like object to read
sheet_id: Sheet number to read (cannot be used with sheet_name)
sheet_name: Sheet name to read (cannot be used with sheet_id)
table_name: Name of a specific table to read
engine: Library used to parse the spreadsheet file ('calamine', 'openpyxl', 'xlsx2csv')
engine_options: Additional options passed to the underlying engine
read_options: Options passed to the underlying engine method that reads the sheet data
has_header: Whether the sheet has a header row
columns: Columns to read from the sheet
schema_overrides: Support type specification or override of one or more columns
infer_schema_length: Number of rows to infer the schema from
drop_empty_rows: Remove empty rows from the result
drop_empty_cols: Remove empty columns from the result
raise_if_empty: Raise an exception if the resulting DataFrame is empty
header_row: Row number to use as header (0-indexed)
cast: Dictionary mapping column names to desired data types for casting
read_all_sheets: Read all sheets in the Excel workbook
lower_column_names: Convert column names to lowercase
clean_column_names: Clean column names by stripping punctuation

Returns:

LazyFrame or dict[str, LazyFrame] if reading multiple sheets

`find_header_row`

Finds the header row in an Excel file by identifying the first row with maximum consecutive non-null values.

def find_header_row(
    source: FileSource,
    sheet_id: int | None = None,
    sheet_name: str | None = None,
    max_rows: int = 200,
    expected_keywords: list[str] | None = None,
) -> int:

Parameters:

source: Path to the Excel file or file-like object to read
sheet_id: Sheet number to read (cannot be used with sheet_name)
sheet_name: Sheet name to read (cannot be used with sheet_id)
max_rows: Maximum number of rows to scan for header identification
expected_keywords: List of keywords to look for in the header row

Returns:

int: The zero-based index of the header row

`rename_columns`

Rename columns of a Polars DataFrame or LazyFrame.

def rename_columns(
    df: pl.DataFrame | pl.LazyFrame,
    columns_map: dict,
    strict: bool = True
) -> pl.DataFrame | pl.LazyFrame:

`reorder_columns`

Reorder columns of a Polars DataFrame or LazyFrame.

def reorder_columns(
    df: pl.DataFrame | pl.LazyFrame,
    columns_order: list[str]
) -> pl.DataFrame | pl.LazyFrame:

`get_missing_columns`

Check if a Polars DataFrame or LazyFrame contains all required columns.

def get_missing_columns(
    df: pl.DataFrame | pl.LazyFrame,
    required_columns: list[str]
) -> list[str]:

File System Module (`rpatoolkit.fs`)

`make_unique_dir`

Creates a unique directory with an optionally formatted name based on current date/time and UUID.

def make_unique_dir(
    base_path: str | Path = ".",
    prefix: str | None = "",
    suffix: str | None = str(uuid4()),
    include_date: bool = True,
    include_time: bool = True,
    use_12h_format: bool = True,
    separator: str = "_",
    time_separator: str = ".",
    date_separator: str = ".",
    create: bool = True,
) -> Path:

Parameters:

base_path: The base directory path where the unique directory will be created
prefix: A prefix to add to the directory name
suffix: A suffix to add to the directory name (defaults to UUID4)
include_date: Whether to include the current date in the directory name
include_time: Whether to include the current time in the directory name
use_12h_format: Whether to use 12-hour format for time (with AM/PM)
separator: The separator to use between different parts of the directory name
time_separator: The separator to use between time components
date_separator: The separator to use between date components
create: Whether to actually create the directory on the filesystem

Dependencies

This project uses the following dependencies:

fastexcel (>=0.16.0) - For efficient Excel file reading
polars (>=1.35.2) - For DataFrame operations
xlsxwriter (>=3.2.9) - For Excel file writing capabilities

Development dependencies:

pytest (>=9.0.1) - For testing

Contributing

Contributions are welcome! Here's how you can contribute:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Add tests if applicable
Run the tests: pytest
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Running Tests

To run the tests:

pytest

Development Setup

# Install development dependencies
uv sync --dev

# Run tests
pytest

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.6

Jan 16, 2026

0.2.5

Jan 16, 2026

0.2.4

Jan 13, 2026

0.2.3

Jan 13, 2026

0.2.2

Jan 13, 2026

0.2.1

Jan 12, 2026

0.2.0

Jan 9, 2026

0.1.5

Dec 3, 2025

0.1.4

Dec 3, 2025

0.1.3

Dec 3, 2025

This version

0.1.2

Nov 19, 2025

0.1.1a3 pre-release

Nov 19, 2025

0.1.1a2 pre-release

Nov 19, 2025

0.1.1a1 pre-release

Nov 18, 2025

0.1.0

Nov 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rpatoolkit-0.1.2.tar.gz (10.5 kB view details)

Uploaded Nov 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rpatoolkit-0.1.2-py3-none-any.whl (14.6 kB view details)

Uploaded Nov 19, 2025 Python 3

File details

Details for the file rpatoolkit-0.1.2.tar.gz.

File metadata

Download URL: rpatoolkit-0.1.2.tar.gz
Upload date: Nov 19, 2025
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.2

File hashes

Hashes for rpatoolkit-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`1a442afc1acde27ce8c5444126beedd52836e06c49d7eede185d7fa12b20ee6f`
MD5	`60da11f9eb3da702635e058ab47bb483`
BLAKE2b-256	`86a718d43f6c1c22f66c70a4ced0528bebf567e22bc4a4b576bd603c8fc89342`

See more details on using hashes here.

File details

Details for the file rpatoolkit-0.1.2-py3-none-any.whl.

File metadata

Download URL: rpatoolkit-0.1.2-py3-none-any.whl
Upload date: Nov 19, 2025
Size: 14.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.2

File hashes

Hashes for rpatoolkit-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`119bfab0ac353a0493ae8e7c86d8e07655a9a683b6542f6510d37ae198077513`
MD5	`8c46e54cff9f711390dc2f66b27d7cb7`
BLAKE2b-256	`f39dc86511053395a72ae457c3498c6a4b65094209c244cf5e0f40edb9b89298`

See more details on using hashes here.

rpatoolkit 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

rpatoolkit

Table of Contents

Overview

Features

Installation

Development Installation

Usage

Data Frame Utilities

Reading Excel Files

Finding Header Row

Column Utilities

File System Utilities

Creating Unique Directories

API Documentation

Data Frame Module (rpatoolkit.df)

read_excel

find_header_row

rename_columns

reorder_columns

get_missing_columns

File System Module (rpatoolkit.fs)

make_unique_dir

Dependencies

Contributing

Running Tests

Development Setup

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Data Frame Module (`rpatoolkit.df`)

`read_excel`

`find_header_row`

`rename_columns`

`reorder_columns`

`get_missing_columns`

File System Module (`rpatoolkit.fs`)

`make_unique_dir`