Skip to main content

Elegant data operations for DataFrames

Project description

Additory v0.1.3a9

Elegant data operations for DataFrames

Python 3.8+ License Status


Overview

Additory is a data transformation library that provides a unified API for common data operations with support for both Polars and Pandas DataFrames.

Note: This is an alpha release (v0.1.3a9) with new scanning and lineage tracking capabilities.

Key Features

  • 🔄 Flexible - Works seamlessly with both Polars and Pandas
  • 🎯 Type-safe - Strong typing with clear, actionable error messages
  • 🧪 Tested - Comprehensive test coverage
  • 📚 Documented - Complete API documentation and usage examples
  • 🚀 Rust-powered - Rust acceleration for performance
  • 📝 Natural Language - English-like parameter names (bring_to, bring_from, bring)
  • 📋 Lists Everywhere - Use lists for multiple values, not tuples
  • 🔍 Data Scanning - Statistical profiling and lineage tracking with add.scan()
  • 📊 Lineage Tracking - Optional operation tracking for debugging and auditing

Installation

pip install additory==0.1.3a9

Requirements

  • Python 3.8+
  • Polars >= 0.19.0
  • NumPy >= 1.20.0

Optional Dependencies

# For development
pip install additory[dev]

Quick Start

import additory as add
import polars as pl

# Add columns from external sources
orders = pl.DataFrame({'order_id': [1, 2], 'customer_id': [101, 102]})
customers = pl.DataFrame({'customer_id': [101, 102], 'name': ['Alice', 'Bob']})
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id')

# Transform data
df = pl.DataFrame({'price': [10.567, 20.123, 30.999]})
result = add.transform('@round:2', df, columns='price')  # Creates price_round

# Fill missing values
df = pl.DataFrame({'age': [25, None, 35, None, 45]})
result = add.transform('@deduce', df, columns='age', strategy={'method': 'mean'})

# Generate synthetic data
result = add.synthetic('@new', n=100, strategy={'age': 'normal(40, 10)'}, seed=42)

# Analyze data with statistical profiling
stats = add.scan('@analyze', df)

# Track lineage for debugging
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id', lineage=True)
result = add.transform('@calc', result, strategy={'total': 'price * quantity'}, lineage=True)
lineage_report = add.scan('@lineage', result)

print(result)

Features (v0.1.3a9)

1. add.to() - Bring Columns from External Sources

Bring columns from one DataFrame to another based on matching keys:

orders = pl.DataFrame({'order_id': [1, 2], 'customer_id': [101, 102]})
customers = pl.DataFrame({'customer_id': [101, 102], 'name': ['Alice', 'Bob']})

# Basic lookup
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id')

# Multiple columns (use lists!)
result = add.to(orders, bring_from=customers, bring=['name', 'email'], against='customer_id')

# With aggregation
result = add.to(customers, bring_from=orders, bring='amount', against='customer_id',
                strategy={'amount': 'sum'})

# With lineage tracking
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id', lineage=True)

2. add.transform() - Transform DataFrames

Transform data using 10 modes:

# @calc - Calculate new columns
result = add.transform('@calc', df, strategy={'total': 'price * quantity'})

# @filter - Filter rows
result = add.transform('@filter', df, where='age > 18')

# @sort - Sort rows
result = add.transform('@sort', df, by='date', strategy={'order': 'desc'})

# @aggregate - Group and aggregate
result = add.transform('@aggregate', df, by='category', strategy={'amount': 'sum'})

# @round - Round numbers (creates NEW columns)
result = add.transform('@round:2', df, columns='price')  # Creates price_round

# @deduce - Fill missing values
result = add.transform('@deduce', df, columns='age', strategy={'method': 'mean'})

# @extract - Extract patterns
result = add.transform('@extract', df, columns='date', strategy={'date': 'dd-MM-yyyy'})

# @onehotencode - One-hot encode
result = add.transform('@onehotencode', df, columns='category')

# @harmonize - Harmonize units
result = add.transform('@harmonize:weight', df)  # Creates weight_kg

# @transpose - Transpose DataFrame
result = add.transform('@transpose', df)

# With lineage tracking
result = add.transform('@calc', df, strategy={'total': 'price * quantity'}, lineage=True)

3. add.synthetic() - Generate Synthetic Data

Generate synthetic data using 3 modes:

# @new - Create new DataFrame
result = add.synthetic('@new', n=1000, strategy={
    'age': 'normal(40, 10)',
    'salary': 'normal(75000, 15000)'
}, seed=42)

# @augment - Add synthetic rows
result = add.synthetic('@augment', df, n=100, seed=42)

# @analyze / @analyse - Analyze data (DEPRECATED - use add.scan('@analyze') instead)
result = add.synthetic('@analyze', df)  # Emits deprecation warning

4. add.scan() - Data Scanning and Lineage Tracking (NEW!)

Scan DataFrames for statistical profiling and lineage tracking:

# @analyze - Statistical profiling
stats = add.scan('@analyze', df)
# Returns: count, missing, unique, mean, std, min, max, quartiles for each column

# Focus on specific aspects
outliers = add.scan('@analyze', df, focus='outliers')
correlations = add.scan('@analyze', df, focus='correlations')
distributions = add.scan('@analyze', df, focus='distributions')

# @lineage - Track operation history
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id', lineage=True)
result = add.transform('@calc', result, strategy={'total': 'price * quantity'}, lineage=True)
lineage_report = add.scan('@lineage', result)
# Shows: operation sequence, row count changes, column sources, data quality warnings

# Focus on specific lineage aspects
null_analysis = add.scan('@lineage', result, focus='nulls')
excluded_rows = add.scan('@lineage', result, focus='excluded')
source_analysis = add.scan('@lineage', result, focus='source:customers')

# Cell-level tracing
cell_trace = add.scan('@lineage', result, trace=[2, 5])  # Trace column 2, row 5
# Shows: complete transformation history for that specific cell

# Filter lineage output
lineage = add.scan('@lineage', result, columns=['total', 'price'])  # Only these columns
lineage = add.scan('@lineage', result, rows='first:10')  # Only first 10 rows

What's New in v0.1.3a9

✅ New Features

  • add.scan() Function: New fourth core function for data scanning and lineage tracking

    • @analyze mode: Statistical profiling with focus modes (outliers, correlations, distributions)
    • @lineage mode: Operation history tracking with focus modes (nulls, excluded, source)
    • Cell-level tracing: Trace individual cell transformations through the pipeline
    • Filtering support: Filter lineage output by columns, rows, or conditions
  • Lineage Tracking: Optional operation tracking across all core functions

    • Add lineage=True to add.to(), add.transform(), add.synthetic()
    • Tracks operation sequence, row count changes, column sources
    • Identifies data quality issues (nulls, excluded rows)
    • Dependency tracking for calculated columns
    • Performance optimized: <15% execution overhead, <25% memory overhead
  • Deprecation: add.synthetic('@analyze') now emits deprecation warning

    • Use add.scan('@analyze') instead for statistical profiling

✅ Previous Changes

  • Natural Language Parameters: bring_to, bring_from, bring (not fetch)
  • Lists Everywhere: Use lists for multiple values, not tuples
  • @round Creates NEW Columns: Philosophy compliant (No Deletion principle)
  • @deduce Mode: Moved from add.deduce() to add.transform('@deduce')
  • Removed Functions: add.set() and add.deduce() no longer exist
  • Default Seed: seed=42 for reproducible synthetic data
  • @extract Merged: @datetime functionality merged into @extract

Strategy Parameter Structure

The strategy parameter provides fine-grained control over operations in all three functions.

add.to() Strategy

Control aggregation, renaming, and positioning for brought columns.

Simple Form (Aggregation Only)

strategy={'col': 'mode'}

Example:

strategy={
    'amount': 'sum',
    'date': 'last'
}

Complex Form (Full Control)

strategy={
    'col': {
        'mode': 'aggregation_mode',
        'rename': 'new_column_name',
        'position': 'position_spec'
    }
}

Example:

strategy={
    'amount': {
        'mode': 'sum',
        'rename': 'total_spent',
        'position': 'after:customer_id'
    },
    'date': {
        'mode': 'last',
        'rename': 'last_order'
    }
}

Aggregation Modes (15)

  • first - First value
  • last - Last value
  • sum - Sum of values
  • count - Count of values
  • average - Average of values
  • min - Minimum value
  • max - Maximum value
  • concat - Concatenate values (comma-separated)
  • concat[sep] - Concatenate with custom separator (e.g., concat[;])
  • most_common - Most common value
  • least_common - Least common value
  • median - Median value
  • std - Standard deviation
  • variance - Variance
  • unique_count - Count of unique values

add.transform() Strategy

Mode-specific configuration options.

@calc Mode

Expressions for calculating new columns:

strategy={'new_column': 'expression'}

Example:

strategy={
    'total': 'price * quantity',
    'discount': 'total * 0.1',
    'final': 'total - discount'
}

@sort Mode

Sort order specification:

strategy={'order': 'asc' | 'desc'}

Example:

strategy={'order': 'desc'}

@aggregate Mode

Aggregation functions per column:

strategy={'column': 'function'}

Example:

strategy={
    'amount': 'sum',
    'count': 'count',
    'price': 'average'
}

@round Mode

Custom naming and positioning for rounded columns:

strategy={
    'column': {
        'name': 'new_column_name',
        'position': 'position_spec'
    }
}

Example:

strategy={
    'price': {
        'name': 'price_clean',
        'position': 'after:price'
    },
    'tax': {
        'name': 'tax_clean'
    }
}

@deduce Mode

KNN imputation parameters:

strategy={'k': int, 'weights': 'uniform' | 'distance'}

Example:

strategy={'k': 5, 'weights': 'distance'}

add.synthetic() Strategy

Column generation specifications.

Simple Form

strategy={'column': 'strategy_type'}

Example:

strategy={
    'id': 'increment',
    'age': 'normal(40, 10)',
    'subject_id': 'pattern:subj{increment:3}'
}

Complex Form

strategy={
    'column': {
        'type': 'strategy_type',
        'param1': value1,
        'param2': value2
    }
}

Example:

strategy={
    'name': {
        'type': 'choice',
        'values': ['Alice', 'Bob', 'Charlie']
    },
    'age': {
        'type': 'normal',
        'mean': 35,
        'std': 10
    }
}

Generation Types

Deterministic:

  • increment - Sequential numbers (1, 2, 3, ...)
  • increment:start - Start from specific number (e.g., increment:100)
  • increment:start:step - Custom start and step (e.g., increment:100:5)
  • pattern:text{increment:padding} - Pattern with leading zeros (e.g., pattern:subj{increment:3})

Random:

  • choice - Random choice from list
  • normal - Normal distribution (mean, std)
  • uniform - Uniform distribution (min, max)
  • lognormal - Log-normal distribution
  • exponential - Exponential distribution (lambda)
  • poisson - Poisson distribution (lambda)
  • categorical - Categorical distribution (probabilities)

Special:

  • linked_list - Linked list structure

Documentation

Complete Documentation

Additional Resources


Examples

add.to() - Lookups and Joins

import additory as add
import polars as pl

orders = pl.DataFrame({
    'order_id': [1, 2, 3],
    'customer_id': [101, 102, 101]
})

customers = pl.DataFrame({
    'customer_id': [101, 102],
    'name': ['Alice', 'Bob']
})

# Basic lookup
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id')

# With aggregation
result = add.to(customers, bring_from=orders, bring='order_id', against='customer_id',
                strategy={'order_id': 'count'})

add.transform() - Transformations

# Calculate new columns
df = pl.DataFrame({'price': [100, 200], 'quantity': [2, 3]})
result = add.transform('@calc', df, strategy={'total': 'price * quantity'})

# Round numbers (creates NEW columns)
df = pl.DataFrame({'price': [10.567, 20.123]})
result = add.transform('@round:2', df, columns='price')  # Creates price_round

# Fill missing values
df = pl.DataFrame({'age': [25, None, 35, None, 45]})
result = add.transform('@deduce', df, columns='age', method='mean')

# KNN imputation
result = add.transform('@deduce', df, columns=['age', 'salary'], method='knn',
                       strategy={'k': 3})

add.synthetic() - Synthetic Data

# Create new DataFrame
result = add.synthetic('@new', n=1000, strategy={
    'age': 'normal(40, 10)',
    'salary': 'normal(75000, 15000)'
}, seed=42)

# Augment existing data
result = add.synthetic('@augment', df, n=100, seed=42)

API Reference

add.to()

Bring columns from one DataFrame to another.

Signature:

def to(
    bring_to,                                    # DataFrame to bring columns to
    bring_from,                                  # DataFrame to bring columns from
    bring: Union[str, List[str]],                # Column(s) to bring
    against: Union[str, List[str]],              # Key(s) to match against
    position: Optional[Union[str, int]] = None,  # Where to place columns
    *,
    strategy: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None,
    join_type: str = 'lookup',
    logging: Union[bool, str] = 'default',
    as_type: Optional[Literal['pandas', 'polars']] = None,
    lineage: bool = False                        # Enable lineage tracking
) -> DataFrame

Parameters:

  • bring_to (DataFrame): Target DataFrame
  • bring_from (DataFrame): Source DataFrame
  • bring (str | list): Column(s) to bring
  • against (str | list): Key column(s) to match
  • position (str | int): Where to place columns ('start', 'end', 'after:col', 'before:col', or int)
  • strategy (dict): Column-level control (aggregation, rename, position)
  • join_type (str): Join type ('lookup', 'left', 'inner', 'outer')
  • logging (bool | str): Logging level (False, True, 'default')
  • as_type (str): Output format (None, 'pandas', 'polars')
  • lineage (bool): Enable lineage tracking (default: False)

Returns:

  • DataFrame: DataFrame with new columns added

add.transform()

Transform data within a DataFrame.

Signature:

def transform(
    mode: str,                                   # Transform mode
    df,                                          # DataFrame to transform
    columns: Optional[Union[str, List[str]]] = None,
    *,
    where: Optional[str] = None,
    by: Optional[Union[str, List[str]]] = None,
    position: Union[str, int] = 'end',
    strategy: Optional[Dict[str, Any]] = None,
    logging: Union[bool, str] = 'default',
    as_type: Optional[Literal['pandas', 'polars']] = None,
    lineage: bool = False                        # Enable lineage tracking
) -> DataFrame

Parameters:

  • mode (str): Transform mode ('@calc', '@filter', '@sort', '@aggregate', '@harmonize', '@round', '@transpose', '@extract', '@onehotencode', '@deduce')
  • df (DataFrame): Input DataFrame
  • columns (str | list): Column(s) to operate on
  • where (str): Filter condition (for @filter)
  • by (str | list): Group/sort columns
  • position (str | int): Where to place new columns
  • strategy (dict): Mode-specific options
  • logging (bool | str): Logging level
  • as_type (str): Output format
  • lineage (bool): Enable lineage tracking (default: False)

Returns:

  • DataFrame: Transformed DataFrame

add.synthetic()

Generate synthetic data.

Signature:

def synthetic(
    mode: str,                                   # Synthetic mode
    df: Optional[DataFrame] = None,              # DataFrame (for @augment/@analyze)
    n: Optional[int] = None,                     # Number of rows
    *,
    strategy: Optional[Dict[str, Any]] = None,   # Column generation strategies
    seed: int = 42,                              # Random seed
    logging: Union[bool, str] = 'default',
    as_type: Optional[Literal['pandas', 'polars']] = None,
    lineage: bool = False                        # Enable lineage tracking
) -> DataFrame

Parameters:

  • mode (str): Synthetic mode ('@new', '@augment', '@analyze'/'@analyse')
  • df (DataFrame): Input DataFrame (for @augment/@analyze)
  • n (int): Number of rows to generate
  • strategy (dict): Column generation strategies
  • seed (int): Random seed (default: 42)
  • logging (bool | str): Logging level
  • as_type (str): Output format
  • lineage (bool): Enable lineage tracking (default: False)

Returns:

  • DataFrame: Generated or augmented DataFrame

add.scan() (NEW!)

Scan DataFrames for statistical profiling and lineage tracking.

Signature:

def scan(
    mode: str,                                   # Scan mode
    df,                                          # DataFrame to scan
    *,
    columns: Optional[Union[str, List[str]]] = None,  # Column filter
    where: Optional[str] = None,                 # Row filter condition
    rows: Optional[str] = None,                  # Row range (first:N, last:N, M-N)
    trace: Optional[List[int]] = None,           # Cell trace [col_idx, row_idx]
    focus: Optional[str] = None,                 # Focus mode
    as_type: Optional[Literal['dataframe', 'dict', 'text']] = 'text'
) -> Union[DataFrame, Dict, str]

Parameters:

  • mode (str): Scan mode ('@analyze' or '@lineage')
  • df (DataFrame): Input DataFrame
  • columns (str | list): Filter output to specific columns
  • where (str): Filter rows by condition
  • rows (str): Row range specification ('first:10', 'last:5', '10-20')
  • trace (list): Cell coordinates for tracing [column_index, row_index]
  • focus (str): Focus mode for detailed analysis
    • For @analyze: 'outliers', 'correlations', 'distributions'
    • For @lineage: 'nulls', 'excluded', 'source:name'
  • as_type (str): Output format ('text', 'dataframe', 'dict')

Returns:

  • str | DataFrame | dict: Scan results in requested format

Examples:

# Statistical profiling
stats = add.scan('@analyze', df)
outliers = add.scan('@analyze', df, focus='outliers')

# Lineage tracking
lineage = add.scan('@lineage', df)  # Requires lineage=True in operations
null_analysis = add.scan('@lineage', df, focus='nulls')
cell_history = add.scan('@lineage', df, trace=[2, 5])

Transform Modes

@calc - Calculate New Columns

Calculate new columns from expressions.

Example:

result = add.transform('@calc', df, strategy={
    'total': 'price * quantity',
    'discount': 'total * 0.1'
})

@filter - Filter Rows

Filter rows based on conditions.

Example:

result = add.transform('@filter', df, where='age > 18 AND status == "active"')

@sort - Sort DataFrame

Sort DataFrame by columns.

Example:

result = add.transform('@sort', df, by='date', strategy={'order': 'desc'})

@aggregate - Group and Aggregate

Group by columns and aggregate.

Example:

result = add.transform('@aggregate', df, by='category', strategy={'amount': 'sum'})

@harmonize - Harmonize Units

Harmonize units across columns (10 sub-modes).

Example:

result = add.transform('@harmonize:weight', df)  # Creates weight_kg

@round - Round Numbers

Round numbers (creates NEW columns).

Example:

result = add.transform('@round:2', df, columns='price')  # Creates price_round

@transpose - Transpose DataFrame

Transpose DataFrame.

Example:

result = add.transform('@transpose', df)

@extract - Extract Patterns

Extract patterns from text or dates.

Example:

result = add.transform('@extract', df, columns='date', pattern='dd-MM-yyyy')

@onehotencode - One-Hot Encode

One-hot encode categorical columns.

Example:

result = add.transform('@onehotencode', df, columns='category')

@deduce - Fill Missing Values

Fill missing values using 7 methods.

Methods: auto, mean, median, mode, forward, backward, knn

Example:

# Mean imputation
result = add.transform('@deduce', df, columns='age', method='mean')

# KNN imputation
result = add.transform('@deduce', df, columns=['age', 'salary'], method='knn',
                       strategy={'k': 3})

Error Handling

Additory provides clear, actionable error messages:

try:
    # Using tuple instead of list
    result = add.to(orders, bring_from=customers, bring=['name'], 
                    against=('customer_id', 'date'))
except TypeError as e:
    print(e)
    # Parameter 'against' must be a list, not tuple.
    # Use ['customer_id', 'date'] instead of ('customer_id', 'date')

try:
    # Column not found
    result = add.transform('@calc', df, strategy={'result': 'nonexistent + 5'})
except RuntimeError as e:
    print(e)
    # Column 'nonexistent' not found in DataFrame
    # Available columns: a, b, c

All errors include:

  • Clear description of what went wrong
  • Contextual information (available options, etc.)
  • Actionable suggestions for fixing the problem

Migration from v0.1.3a5

If you're upgrading from v0.1.3a5, here are the key changes:

Parameter Renames

# OLD (v0.1.3a5)
add.to(orders, fetch_from=customers, fetch=['name'], by='customer_id')

# NEW (v0.1.3a9)
add.to(orders, bring_from=customers, bring=['name'], against='customer_id')

Lists Instead of Tuples

# OLD (v0.1.3a5)
add.to(orders, fetch_from=customers, fetch=['name'], by=('id', 'date'))

# NEW (v0.1.3a9)
add.to(orders, bring_from=customers, bring=['name'], against=['id', 'date'])

Removed Functions

# OLD (v0.1.3a5)
add.set(logging=True)
add.deduce(df, 'age', method='mean')

# NEW (v0.1.3a9)
# add.set() removed - use logging parameter per function
add.to(..., logging=True)
add.transform(..., logging=True)

# add.deduce() moved to transform mode
add.transform('@deduce', df, columns='age', method='mean')

@round Creates NEW Columns

# OLD (v0.1.3a5)
# @round modified columns in-place

# NEW (v0.1.3a9)
# @round creates NEW columns
result = add.transform('@round:2', df, columns='price')
# Creates: price_round (original price column unchanged)

Development

Running Tests

# Integration tests
cd python-specific
pytest tests/test_integration.py -v

# Rust tests
cd rust-core
cargo test --all

# Benchmarks
cd python-specific
python benchmarks/benchmark_integration.py

Building from Source

# Build Rust module
cd rust-core
cargo build --release

# Install Python package
cd python-specific
pip install -e .

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.


License

See LICENSE file for details.


Support

For issues or questions:


Changelog

See CHANGELOG.md for version history and changes.


Acknowledgments

Built with:

  • Rust - Systems programming language
  • PyO3 - Rust bindings for Python
  • Polars - Fast DataFrame library
  • Pandas - Data analysis library

Status: Alpha Release
Version: 0.1.3a9
Date: March 8, 2026

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

additory-0.1.3a9-py3-none-any.whl (71.9 kB view details)

Uploaded Python 3

File details

Details for the file additory-0.1.3a9-py3-none-any.whl.

File metadata

  • Download URL: additory-0.1.3a9-py3-none-any.whl
  • Upload date:
  • Size: 71.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for additory-0.1.3a9-py3-none-any.whl
Algorithm Hash digest
SHA256 966119c6e05a14d37d8931bc6c9b4d7c9b11f4e1f06adb636e3aead208fd536c
MD5 f3d1107d0c5de2f14e6fc21d5a48a495
BLAKE2b-256 a197019be7431714642afbf41e82cf9f914ed7dc633c63a0cec24fde86d71dcd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page