Skip to main content

Elegant data operations for DataFrames - add.to(), add.transform(), add.synthetic()

Project description

additory

Elegant data operations for DataFrames

A Rust-powered Python library for intuitive data transformations, lookups, and synthetic data generation with Polars and Pandas.

PyPI version Python Support License: MIT

Features

  • 🔗 Intuitive Lookups - Add columns from external sources with simple syntax
  • Powerful Transforms - Calculate, filter, sort, aggregate with mode-based operations
  • 🎲 Synthetic Data - Generate realistic test data or augment existing datasets
  • 📊 Lineage Tracking - Track data transformations and view operation history
  • 🔍 Data Scanning - Analyze data quality and inspect DataFrames
  • 🚀 Rust Performance - Built with Rust for blazing-fast operations
  • 🐼 Polars & Pandas - Works seamlessly with both DataFrame libraries
  • 📚 Expression Library - 179 built-in expressions for medical, finance, physics, and more

Installation

pip install additory

Requirements:

  • Python 3.8+
  • Polars (required)
  • Pandas (optional)

Quick Start

import additory as add
import polars as pl

# Add data from external sources
orders = pl.DataFrame({'id': [1, 2], 'customer_id': [101, 102]})
customers = pl.DataFrame({'customer_id': [101, 102], 'name': ['Alice', 'Bob']})
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id')

# Transform data
df = pl.DataFrame({'x': [1, 2, 3]})
result = add.transform('@calc', df, strategy={'x_squared': 'x ** 2'})

# Generate synthetic data
result = add.synthetic('@new', n=100, strategy={'age': 'normal(40, 10)'})

Core Functions

add.to() - Add Data from External Sources

result = add.to(bring_to, bring_from=reference_df, bring=['column'], against='key',
                lineage=False)

Perfect for lookups and joins. Enable lineage=True to track data sources.

add.transform() - Transform Data

result = add.transform(mode, df, lineage=False, **parameters)

Available modes:

  • @calc - Calculate new columns with expressions
  • @filter - Filter rows and select columns
  • @sort - Sort data by columns
  • @aggregate - Group and aggregate data
  • @harmonize - Harmonize units (10 sub-modes)
  • @round - Round numbers (creates NEW columns)
  • @transpose - Transpose DataFrame
  • @extract - Extract patterns from text/dates
  • @onehotencode - One-hot encode categorical columns
  • @deduce - Fill missing values (7 methods)

add.synthetic() - Synthetic Data

result = add.synthetic(mode, df_or_n, lineage=False, **parameters)

Available modes:

  • @new - Create synthetic DataFrames from scratch
  • @augment - Add synthetic rows to existing data

add.scan() - Inspect and Analyze DataFrames

result = add.scan(mode, df)

Available modes:

  • @analyze / @analyse - Analyze data quality and distributions
  • @lineage - View lineage tracking reports (requires lineage=True in operations)

Strategy Parameter

The strategy parameter provides fine-grained control over operations in all three functions.

add.to() Strategy

Control aggregation, renaming, and positioning for brought columns:

Simple form (aggregation only):

strategy={'amount': 'sum', 'date': 'last'}

Complex form (full control):

strategy={
    'amount': {
        'mode': 'sum',
        'rename': 'total_spent',
        'position': 'after:customer_id'
    }
}

Aggregation modes: first, last, sum, count, average, min, max, concat, concat[sep], most_common, least_common, median, std, variance, unique_count

add.transform() Strategy

Mode-specific configuration:

@calc - Expressions for new columns:

strategy={'total': 'price * quantity', 'discount': 'total * 0.1'}

@sort - Sort order:

strategy={'order': 'desc'}  # or 'asc'

@aggregate - Aggregation functions:

strategy={'amount': 'sum', 'count': 'count'}

@round - Custom naming and positioning:

strategy={
    'price': {'name': 'price_clean', 'position': 'after:price'}
}

@deduce - KNN parameters:

strategy={'k': 5, 'weights': 'distance'}

add.synthetic() Strategy

Column generation specifications:

Simple form:

strategy={'id': 'increment', 'age': 'normal(40, 10)'}

Complex form:

strategy={
    'name': {'type': 'choice', 'values': ['Alice', 'Bob', 'Charlie']},
    'age': {'type': 'normal', 'mean': 35, 'std': 10}
}

Generation types: increment, pattern, choice, normal, uniform, lognormal, exponential, poisson, categorical

Lineage Tracking

Track data transformations across operations to understand data provenance and transformation history.

Enable Lineage Tracking

import additory as add
import pandas as pd

# Enable lineage in any operation
result = add.to(customers, bring_from=orders, bring=['amount'], 
                against='customer_id', lineage=True)

# Lineage is preserved across operations
result = add.transform('@calc', result, expression='amount * 1.1', 
                       name='total', lineage=True)

# View lineage report
lineage_report = add.scan('@lineage', result)
print(lineage_report)

Lineage Features

  • Operation History - Track all transformations applied to data
  • Column Sources - See where each column came from
  • Row Mappings - Track how rows were filtered or aggregated
  • Session-Only - Lineage is stored in-memory (not persisted to disk)
  • Mutual Exclusion - Cannot use lineage=True with as_type parameter

Lineage Example

# Multi-step workflow with lineage
customers = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Carol']})
orders = pd.DataFrame({'id': [1, 1, 2, 3, 3], 'amount': [100, 150, 200, 175, 125]})

# Step 1: Bring data
df = add.to(customers, bring_from=orders, bring=['amount'], against='id',
            strategy={'amount': 'sum'}, lineage=True)

# Step 2: Calculate
df = add.transform('@calc', df, expression='amount * 1.1', name='total', lineage=True)

# Step 3: Filter
df = add.transform('@filter', df, where='total > 200', lineage=True)

# View complete lineage
report = add.scan('@lineage', df)
# Shows: 3 operations, column sources, row transformations

Important Notes

  • Lineage is session-only by design (follows "no file I/O" philosophy)
  • Lineage metadata is lost when DataFrames are saved with native methods
  • Cannot use lineage=True with as_type parameter (metadata would be lost during conversion)
  • Lineage overhead is minimal (<3ms per operation)

Documentation

📚 Complete documentation is available in the /docs directory:

See docs/README.md for the complete documentation index.

Examples

Lookup Example

import additory as add
import polars as pl

# Orders with customer IDs
orders = pl.DataFrame({
    'order_id': [1, 2, 3],
    'customer_id': [101, 102, 101],
    'amount': [100, 200, 150]
})

# Customer reference data
customers = pl.DataFrame({
    'customer_id': [101, 102],
    'name': ['Alice', 'Bob'],
    'city': ['NYC', 'LA']
})

# Add customer info to orders
result = add.to(orders, bring_from=customers, bring=['name', 'city'], against='customer_id')

Transform Example

# Calculate with expressions
df = pl.DataFrame({'price': [100, 200, 300], 'quantity': [2, 3, 1]})
result = add.transform('@calc', df, strategy={'total': 'price * quantity'})

# Filter data
result = add.transform('@filter', df, where='price > 150')

# Sort data
result = add.transform('@sort', df, by='price', strategy={'order': 'desc'})

# Aggregate data
df = pl.DataFrame({'category': ['A', 'B', 'A'], 'value': [10, 20, 30]})
result = add.transform('@aggregate', df, by='category', strategy={'value': 'sum'})

# Round numbers (creates NEW columns)
df = pl.DataFrame({'price': [10.567, 20.123, 30.999]})
result = add.transform('@round:2', df, columns='price')  # Creates price_round

# Fill missing values
df = pl.DataFrame({'age': [25, None, 35, None, 45]})
result = add.transform('@deduce', df, columns='age', method='mean')

Synthetic Data Example

# Create synthetic data
result = add.synthetic('@new', n=1000, strategy={
    'age': 'normal(40, 10)',
    'salary': 'normal(75000, 15000)',
    'score': 'uniform(0, 100)'
})

# Augment existing data
df = pl.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
result = add.synthetic('@augment', df, n=100)

# Analyze data quality
result = add.synthetic('@analyze', df)

Version

Current version: 0.1.3 (Stable Alpha)

What's New in v0.1.3

  • Lineage Tracking - Track data transformations with lineage=True parameter
  • add.scan() Function - Unified interface for @analyze and @lineage modes
  • ~95% Rust Implementation - Optimized code distribution for performance
  • Mutual Exclusion Validation - Clear error messages for lineage + as_type
  • Helper Functions - Internal utilities for lineage tracking
  • Bug Fixes - Fixed add.to() parameter mapping bug
  • Code Cleanup - Removed orphan files and dead code
  • 341/341 Tests Passing - 100% test coverage

Development

Building from Source

# Clone the repository
git clone https://github.com/YOUR_USERNAME/additory.git
cd additory

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Build the package
cd rust-core
export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1
maturin build --release

# Install locally
pip install target/wheels/*.whl

Running Tests

# Run comprehensive test suite
python test_all_modes_comprehensive.py

# Run specific tests
pytest tests/

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details

Changelog

v0.1.3 (March 9, 2026)

  • Lineage Tracking - Track data transformations across operations
  • add.scan() Function - Unified scanning interface (@analyze, @lineage)
  • ~95% Rust Implementation - Optimized Python/Rust code distribution
  • Bug Fixes - Fixed add.to() parameter mapping, cleaned up code
  • 341/341 Tests Passing - Complete test coverage

v0.1.3a9 (March 4, 2026)

  • Updated API signatures for natural language (bring_to, bring_from, bring)
  • Lists everywhere instead of tuples
  • @round creates NEW columns (philosophy compliant)
  • @deduce mode for missing value imputation
  • @extract merged with datetime parsing
  • Removed add.set() and add.deduce() functions
  • Default seed=42 for reproducibility
  • 100% philosophy compliance

v0.1.3a3 (February 9, 2026)

  • Made pandas optional
  • Added cross-platform build scripts
  • Fixed pandas import issues
  • 100% test pass rate

v0.1.3a2 (February 9, 2026)

  • Added banker's rounding (@bankers_round mode)
  • Expanded expression library to 179 expressions
  • Fixed mode detection issues
  • Fixed power operator (**) support

v0.1.3a1 (February 2026)

  • Initial alpha release
  • Rust core with PyO3 bindings
  • Three-function API (to, transform, synthetic)

Support

For issues, questions, or contributions, please visit:

  • GitHub Issues: [Coming Soon]
  • Documentation: [Coming Soon]

Credits

Built with:


Made with ❤️ for the data science community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

additory-0.1.3a10-cp313-cp313-manylinux_2_39_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.39+ x86-64

File details

Details for the file additory-0.1.3a10-cp313-cp313-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for additory-0.1.3a10-cp313-cp313-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 b3e75f89641c992863ffff94dd00cef8ea98a485192d8e73b4fc20355c2ca5ca
MD5 d1c9764674b6b75f122698ed688649f6
BLAKE2b-256 e3ea60e8990c106e33b086dba2858972f8ddfd9c10a21d2b2f29f8b02a5bb2d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page