Skip to main content

Elegant data operations for DataFrames - add.to(), add.transform(), add.synthetic()

Project description

additory

Elegant data operations for DataFrames with Rust-powered performance

PyPI version Python 3.9+ License: MIT


Overview

additory provides three simple, powerful functions for DataFrame operations:

  • add.to() - Add data FROM external sources (lookup, join, merge)
  • add.transform() - Transform data WITHIN DataFrames (filter, calculate, aggregate)
  • add.synthetic() - Create or augment with synthetic data

Built with Rust for performance, works seamlessly with pandas and polars.


Installation

# Basic installation (includes polars)
pip install additory

# With pandas support (recommended for pandas users)
pip install additory[pandas]

Requirements:

  • Python 3.9 or higher
  • polars 0.19.0+ (included automatically)
  • pandas 1.5.0+ (optional, install with pip install additory[pandas])

Note: additory uses polars internally for high-performance operations, but seamlessly works with pandas DataFrames through automatic conversion.


Quick Start

import pandas as pd
import additory as add

# Create sample data
customers = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

orders = pd.DataFrame({
    'id': [1, 2, 3],
    'total': [100, 200, 150]
})

# Add data from another DataFrame
result = add.to(customers, fetch_from=orders, fetch=['total'], by='id')
# Result: customers with 'total' column added

# Transform data
result = add.transform('@calc', customers, expression='id * 10', as_='customer_code')
# Result: customers with calculated 'customer_code' column

# Generate synthetic data
synthetic = add.synthetic('@new', n=1000, fetch={
    'age': 'normal(35, 10)',
    'salary': 'lognormal(11, 0.5)'
})
# Result: 1000 rows of synthetic data

Works with Polars too! Simply replace import pandas as pd with import polars as pl and use pl.DataFrame() instead of pd.DataFrame().


Features

add.to() - Data Integration

Add columns from external sources with intelligent joining:

# Single column lookup
result = add.to(target, fetch_from=reference, fetch=['age'], by='id')

# Multiple columns
result = add.to(target, fetch_from=reference, fetch=['age', 'city'], by='id')

# Multiple join keys
result = add.to(target, fetch_from=reference, fetch=['amount'], by=('customer_id', 'date'))

# With aggregation
result = add.to(target, fetch_from=reference, fetch=['amount'], by='id',
                strategy={'mode': 'sum'})

Supported modes:

  • Lookup (default) - Add columns by joining on keys
  • Aggregation - Sum, mean, first, last, concat, etc.

add.transform() - Data Transformation

Transform data with 10+ modes:

# Filter rows
result = add.transform('@filter', df, where='age > 25')

# Calculate new columns
result = add.transform('@calc', df, expression='price * quantity', as_='total')

# Sort data
result = add.transform('@sort', df, by='date', as_='asc')

# Aggregate data
result = add.transform('@aggregate', df, by='category', 
                       fetch=['sales'], strategy={'mode': 'sum'})

# One-hot encoding
result = add.transform('@onehot', df, fetch=['category'])

# KNN imputation
result = add.transform('@knn', df, fetch=['age'], strategy={'k': 5})

Supported modes:

  • @filter - Filter rows and select columns
  • @calc - Calculate new columns from expressions
  • @sort - Sort by column(s)
  • @aggregate - Group and aggregate
  • @transpose - Transpose DataFrame
  • @split - Split text columns
  • @extract - Extract datetime components
  • @onehot - One-hot encoding
  • @label - Label encoding
  • @harmonize - Unit conversions
  • @knn - K-Nearest Neighbors imputation

add.synthetic() - Synthetic Data Generation

Create or augment data with statistical distributions:

# Create new synthetic data
result = add.synthetic('@new', n=1000, fetch={
    'age': 'normal(50, 10)',           # Normal distribution
    'salary': 'lognormal(11, 0.5)',    # Lognormal distribution
    'score': 'uniform(0, 100)',        # Uniform distribution
    'status': 'categorical'             # Categorical data
})

# Augment existing data
result = add.synthetic(df, n=500)  # Add 500 synthetic rows

# Analyze data quality
analysis = add.synthetic('@analyze', df)  # Get statistics

Supported distributions:

  • Normal, Lognormal, Uniform, Exponential, Poisson, Binomial, Beta
  • Categorical (simple and weighted)
  • Sequences, Date/Time ranges
  • Patterns (email, phone, UUID, regex)

Performance

additory is built with Rust for high performance:

  • 3-5x faster than pure Python for transformations
  • 5-10x faster for data joining operations
  • 10-20x faster for synthetic data generation

Efficient memory usage with Arrow IPC serialization and vectorized operations.

DataFrame Support: Works with both pandas and polars DataFrames. Polars is required (installed automatically), and pandas DataFrames are seamlessly converted for high-performance operations.


Documentation

API Reference

add.to()

add.to(fetch_to, fetch_from, fetch, against, position=None, *, 
       strategy=None, join_type='lookup', as_type=None)

Parameters:

  • fetch_to: Target DataFrame
  • fetch_from: Reference DataFrame
  • fetch: Column(s) to add (str or list)
  • against: Join key(s) (str or tuple)
  • position: Column position (optional)
  • strategy: Aggregation strategy (optional)
  • join_type: Join type ('lookup', 'left', 'inner', 'outer')
  • as_type: Output format ('polars', 'pandas', or None)

add.transform()

add.transform(mode, df, expression=None, *, where=None, by=None, 
              fetch=None, strategy=None, as_=None, fetch_at='end', 
              logging=False)

Parameters:

  • mode: Transform mode (e.g., '@calc', '@filter', '@sort')
  • df: Input DataFrame
  • expression: Expression(s) for @calc mode
  • where: Filter condition
  • by: Grouping/sorting column(s)
  • fetch: Column(s) to transform
  • strategy: Advanced options
  • as_: New column name(s) or sort order
  • fetch_at: Position for new columns
  • logging: Enable detailed logging

add.synthetic()

add.synthetic(mode_or_df=None, df=None, **kwargs)

Parameters:

  • mode_or_df: Mode string ('@new', '@analyze') or DataFrame (for augment)
  • df: DataFrame (for @analyze mode)
  • n: Number of rows to generate
  • fetch: Column specifications (for @new mode)
  • strategy: Advanced options
  • logging: Enable detailed logging

Examples

Data Integration Example

import pandas as pd
import additory as add

# Customer data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David']
})

# Order data
orders = pd.DataFrame({
    'customer_id': [1, 1, 2, 3, 3, 3],
    'amount': [100, 150, 200, 50, 75, 125]
})

# Add total order amount per customer
result = add.to(customers, fetch_from=orders, 
                fetch=['amount'], by='customer_id',
                strategy={'mode': 'sum'})

print(result)
# customer_id | name    | amount
# 1           | Alice   | 250
# 2           | Bob     | 200
# 3           | Charlie | 250
# 4           | David   | NaN

Data Transformation Example

import pandas as pd
import additory as add

# Sales data
sales = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-02', '2024-01-03'],
    'product': ['A', 'B', 'A'],
    'quantity': [10, 15, 20],
    'price': [100, 200, 100]
})

# Calculate total sales
result = add.transform('@calc', sales, 
                       expression='quantity * price', 
                       as_='total')

# Filter high-value sales
result = add.transform('@filter', result, where='total > 1500')

print(result)
# date       | product | quantity | price | total
# 2024-01-02 | B       | 15       | 200   | 3000
# 2024-01-03 | A       | 20       | 100   | 2000

Synthetic Data Example

import additory as add

# Generate synthetic customer data
customers = add.synthetic('@new', n=10000, fetch={
    'age': 'normal(35, 12)',
    'income': 'lognormal(10.5, 0.5)',
    'credit_score': 'uniform(300, 850)',
    'segment': 'categorical'
})

# Analyze the generated data
analysis = add.synthetic('@analyze', customers)
print(analysis)
# Shows statistics: mean, std, min, max, null count, etc.

Note: Synthetic data is returned as a pandas DataFrame by default. Use as_type='polars' if you prefer polars.


Development Status

Current Version: 0.1.3a5 (Beta)

Status: Production-ready for core features

Test Coverage:

  • 106 Rust tests passing (100%)
  • Comprehensive integration tests
  • All three functions fully tested

Roadmap:

  • ✅ Core functionality (add.to, add.transform, add.synthetic)
  • ✅ Rust-powered performance
  • ✅ Polars and Pandas support
  • ✅ Comprehensive test coverage
  • 🔄 Additional transform modes
  • 🔄 Enhanced expression parsing
  • 🔄 Extended documentation

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Repository: https://github.com/sekarkrishna/additory


License

MIT License - see LICENSE file for details


Author

Krishnamoorthy Sankaran
Email: krishnamoorthy.sankaran@sekrad.org
GitHub: https://github.com/sekarkrishna/additory


Support


Acknowledgments

Built with:

  • Rust - Performance and safety
  • Polars - Fast DataFrame operations
  • PyO3 - Python-Rust bindings
  • Maturin - Build system

Made with ❤️ for the data science community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

additory-0.1.3a5-cp313-cp313-manylinux_2_34_x86_64.whl (11.6 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

File details

Details for the file additory-0.1.3a5-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for additory-0.1.3a5-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 1628e5a91d4a88641a019bb111a11c71cbdd044e3b139c9ddeeb85d743f66142
MD5 651cfce015e3b5a73bd8be1d627728e8
BLAKE2b-256 794bd825d92ab2aea1572a5a2a4168739c835b2bfcc781fb829e40aa1d397006

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page