Elegant data operations for DataFrames
Project description
Additory v0.1.3a9
Elegant data operations for DataFrames
Overview
Additory is a data transformation library that provides a unified API for common data operations with support for both Polars and Pandas DataFrames.
Note: This is an alpha release (v0.1.3a9) with new scanning and lineage tracking capabilities.
Key Features
- 🔄 Flexible - Works seamlessly with both Polars and Pandas
- 🎯 Type-safe - Strong typing with clear, actionable error messages
- 🧪 Tested - Comprehensive test coverage
- 📚 Documented - Complete API documentation and usage examples
- 🚀 Rust-powered - Rust acceleration for performance
- 📝 Natural Language - English-like parameter names (bring_to, bring_from, bring)
- 📋 Lists Everywhere - Use lists for multiple values, not tuples
- 🔍 Data Scanning - Statistical profiling and lineage tracking with add.scan()
- 📊 Lineage Tracking - Optional operation tracking for debugging and auditing
Installation
pip install additory==0.1.3a9
Requirements
- Python 3.8+
- Polars >= 0.19.0
- NumPy >= 1.20.0
Optional Dependencies
# For development
pip install additory[dev]
Quick Start
import additory as add
import polars as pl
# Add columns from external sources
orders = pl.DataFrame({'order_id': [1, 2], 'customer_id': [101, 102]})
customers = pl.DataFrame({'customer_id': [101, 102], 'name': ['Alice', 'Bob']})
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id')
# Transform data
df = pl.DataFrame({'price': [10.567, 20.123, 30.999]})
result = add.transform('@round:2', df, columns='price') # Creates price_round
# Fill missing values
df = pl.DataFrame({'age': [25, None, 35, None, 45]})
result = add.transform('@deduce', df, columns='age', strategy={'method': 'mean'})
# Generate synthetic data
result = add.synthetic('@new', n=100, strategy={'age': 'normal(40, 10)'}, seed=42)
# Analyze data with statistical profiling
stats = add.scan('@analyze', df)
# Track lineage for debugging
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id', lineage=True)
result = add.transform('@calc', result, strategy={'total': 'price * quantity'}, lineage=True)
lineage_report = add.scan('@lineage', result)
print(result)
Features (v0.1.3a9)
1. add.to() - Bring Columns from External Sources
Bring columns from one DataFrame to another based on matching keys:
orders = pl.DataFrame({'order_id': [1, 2], 'customer_id': [101, 102]})
customers = pl.DataFrame({'customer_id': [101, 102], 'name': ['Alice', 'Bob']})
# Basic lookup
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id')
# Multiple columns (use lists!)
result = add.to(orders, bring_from=customers, bring=['name', 'email'], against='customer_id')
# With aggregation
result = add.to(customers, bring_from=orders, bring='amount', against='customer_id',
strategy={'amount': 'sum'})
# With lineage tracking
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id', lineage=True)
2. add.transform() - Transform DataFrames
Transform data using 10 modes:
# @calc - Calculate new columns
result = add.transform('@calc', df, strategy={'total': 'price * quantity'})
# @filter - Filter rows
result = add.transform('@filter', df, where='age > 18')
# @sort - Sort rows
result = add.transform('@sort', df, by='date', strategy={'order': 'desc'})
# @aggregate - Group and aggregate
result = add.transform('@aggregate', df, by='category', strategy={'amount': 'sum'})
# @round - Round numbers (creates NEW columns)
result = add.transform('@round:2', df, columns='price') # Creates price_round
# @deduce - Fill missing values
result = add.transform('@deduce', df, columns='age', strategy={'method': 'mean'})
# @extract - Extract patterns
result = add.transform('@extract', df, columns='date', strategy={'date': 'dd-MM-yyyy'})
# @onehotencode - One-hot encode
result = add.transform('@onehotencode', df, columns='category')
# @harmonize - Harmonize units
result = add.transform('@harmonize:weight', df) # Creates weight_kg
# @transpose - Transpose DataFrame
result = add.transform('@transpose', df)
# With lineage tracking
result = add.transform('@calc', df, strategy={'total': 'price * quantity'}, lineage=True)
3. add.synthetic() - Generate Synthetic Data
Generate synthetic data using 3 modes:
# @new - Create new DataFrame
result = add.synthetic('@new', n=1000, strategy={
'age': 'normal(40, 10)',
'salary': 'normal(75000, 15000)'
}, seed=42)
# @augment - Add synthetic rows
result = add.synthetic('@augment', df, n=100, seed=42)
# @analyze / @analyse - Analyze data (DEPRECATED - use add.scan('@analyze') instead)
result = add.synthetic('@analyze', df) # Emits deprecation warning
4. add.scan() - Data Scanning and Lineage Tracking (NEW!)
Scan DataFrames for statistical profiling and lineage tracking:
# @analyze - Statistical profiling
stats = add.scan('@analyze', df)
# Returns: count, missing, unique, mean, std, min, max, quartiles for each column
# Focus on specific aspects
outliers = add.scan('@analyze', df, focus='outliers')
correlations = add.scan('@analyze', df, focus='correlations')
distributions = add.scan('@analyze', df, focus='distributions')
# @lineage - Track operation history
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id', lineage=True)
result = add.transform('@calc', result, strategy={'total': 'price * quantity'}, lineage=True)
lineage_report = add.scan('@lineage', result)
# Shows: operation sequence, row count changes, column sources, data quality warnings
# Focus on specific lineage aspects
null_analysis = add.scan('@lineage', result, focus='nulls')
excluded_rows = add.scan('@lineage', result, focus='excluded')
source_analysis = add.scan('@lineage', result, focus='source:customers')
# Cell-level tracing
cell_trace = add.scan('@lineage', result, trace=[2, 5]) # Trace column 2, row 5
# Shows: complete transformation history for that specific cell
# Filter lineage output
lineage = add.scan('@lineage', result, columns=['total', 'price']) # Only these columns
lineage = add.scan('@lineage', result, rows='first:10') # Only first 10 rows
What's New in v0.1.3a9
✅ New Features
-
add.scan() Function: New fourth core function for data scanning and lineage tracking
@analyzemode: Statistical profiling with focus modes (outliers, correlations, distributions)@lineagemode: Operation history tracking with focus modes (nulls, excluded, source)- Cell-level tracing: Trace individual cell transformations through the pipeline
- Filtering support: Filter lineage output by columns, rows, or conditions
-
Lineage Tracking: Optional operation tracking across all core functions
- Add
lineage=Trueto add.to(), add.transform(), add.synthetic() - Tracks operation sequence, row count changes, column sources
- Identifies data quality issues (nulls, excluded rows)
- Dependency tracking for calculated columns
- Performance optimized: <15% execution overhead, <25% memory overhead
- Add
-
Deprecation: add.synthetic('@analyze') now emits deprecation warning
- Use add.scan('@analyze') instead for statistical profiling
✅ Previous Changes
- Natural Language Parameters:
bring_to,bring_from,bring(not fetch) - Lists Everywhere: Use lists for multiple values, not tuples
- @round Creates NEW Columns: Philosophy compliant (No Deletion principle)
- @deduce Mode: Moved from add.deduce() to add.transform('@deduce')
- Removed Functions: add.set() and add.deduce() no longer exist
- Default Seed: seed=42 for reproducible synthetic data
- @extract Merged: @datetime functionality merged into @extract
Strategy Parameter Structure
The strategy parameter provides fine-grained control over operations in all three functions.
add.to() Strategy
Control aggregation, renaming, and positioning for brought columns.
Simple Form (Aggregation Only)
strategy={'col': 'mode'}
Example:
strategy={
'amount': 'sum',
'date': 'last'
}
Complex Form (Full Control)
strategy={
'col': {
'mode': 'aggregation_mode',
'rename': 'new_column_name',
'position': 'position_spec'
}
}
Example:
strategy={
'amount': {
'mode': 'sum',
'rename': 'total_spent',
'position': 'after:customer_id'
},
'date': {
'mode': 'last',
'rename': 'last_order'
}
}
Aggregation Modes (15)
- first - First value
- last - Last value
- sum - Sum of values
- count - Count of values
- average - Average of values
- min - Minimum value
- max - Maximum value
- concat - Concatenate values (comma-separated)
- concat[sep] - Concatenate with custom separator (e.g., concat[;])
- most_common - Most common value
- least_common - Least common value
- median - Median value
- std - Standard deviation
- variance - Variance
- unique_count - Count of unique values
add.transform() Strategy
Mode-specific configuration options.
@calc Mode
Expressions for calculating new columns:
strategy={'new_column': 'expression'}
Example:
strategy={
'total': 'price * quantity',
'discount': 'total * 0.1',
'final': 'total - discount'
}
@sort Mode
Sort order specification:
strategy={'order': 'asc' | 'desc'}
Example:
strategy={'order': 'desc'}
@aggregate Mode
Aggregation functions per column:
strategy={'column': 'function'}
Example:
strategy={
'amount': 'sum',
'count': 'count',
'price': 'average'
}
@round Mode
Custom naming and positioning for rounded columns:
strategy={
'column': {
'name': 'new_column_name',
'position': 'position_spec'
}
}
Example:
strategy={
'price': {
'name': 'price_clean',
'position': 'after:price'
},
'tax': {
'name': 'tax_clean'
}
}
@deduce Mode
KNN imputation parameters:
strategy={'k': int, 'weights': 'uniform' | 'distance'}
Example:
strategy={'k': 5, 'weights': 'distance'}
add.synthetic() Strategy
Column generation specifications.
Simple Form
strategy={'column': 'strategy_type'}
Example:
strategy={
'id': 'increment',
'age': 'normal(40, 10)',
'subject_id': 'pattern:subj{increment:3}'
}
Complex Form
strategy={
'column': {
'type': 'strategy_type',
'param1': value1,
'param2': value2
}
}
Example:
strategy={
'name': {
'type': 'choice',
'values': ['Alice', 'Bob', 'Charlie']
},
'age': {
'type': 'normal',
'mean': 35,
'std': 10
}
}
Generation Types
Deterministic:
- increment - Sequential numbers (1, 2, 3, ...)
- increment:start - Start from specific number (e.g., increment:100)
- increment:start:step - Custom start and step (e.g., increment:100:5)
- pattern:text{increment:padding} - Pattern with leading zeros (e.g., pattern:subj{increment:3})
Random:
- choice - Random choice from list
- normal - Normal distribution (mean, std)
- uniform - Uniform distribution (min, max)
- lognormal - Log-normal distribution
- exponential - Exponential distribution (lambda)
- poisson - Poisson distribution (lambda)
- categorical - Categorical distribution (probabilities)
Special:
- linked_list - Linked list structure
Documentation
Complete Documentation
- API Documentation - Complete API reference
- Usage Examples - Real-world usage examples
- CHANGELOG - Version history and changes
Additional Resources
- Performance Benchmarks - Detailed performance analysis
- Error Handling - Error handling guide
- Integration Tests - Test coverage details
Examples
add.to() - Lookups and Joins
import additory as add
import polars as pl
orders = pl.DataFrame({
'order_id': [1, 2, 3],
'customer_id': [101, 102, 101]
})
customers = pl.DataFrame({
'customer_id': [101, 102],
'name': ['Alice', 'Bob']
})
# Basic lookup
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id')
# With aggregation
result = add.to(customers, bring_from=orders, bring='order_id', against='customer_id',
strategy={'order_id': 'count'})
add.transform() - Transformations
# Calculate new columns
df = pl.DataFrame({'price': [100, 200], 'quantity': [2, 3]})
result = add.transform('@calc', df, strategy={'total': 'price * quantity'})
# Round numbers (creates NEW columns)
df = pl.DataFrame({'price': [10.567, 20.123]})
result = add.transform('@round:2', df, columns='price') # Creates price_round
# Fill missing values
df = pl.DataFrame({'age': [25, None, 35, None, 45]})
result = add.transform('@deduce', df, columns='age', method='mean')
# KNN imputation
result = add.transform('@deduce', df, columns=['age', 'salary'], method='knn',
strategy={'k': 3})
add.synthetic() - Synthetic Data
# Create new DataFrame
result = add.synthetic('@new', n=1000, strategy={
'age': 'normal(40, 10)',
'salary': 'normal(75000, 15000)'
}, seed=42)
# Augment existing data
result = add.synthetic('@augment', df, n=100, seed=42)
API Reference
add.to()
Bring columns from one DataFrame to another.
Signature:
def to(
bring_to, # DataFrame to bring columns to
bring_from, # DataFrame to bring columns from
bring: Union[str, List[str]], # Column(s) to bring
against: Union[str, List[str]], # Key(s) to match against
position: Optional[Union[str, int]] = None, # Where to place columns
*,
strategy: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None,
join_type: str = 'lookup',
logging: Union[bool, str] = 'default',
as_type: Optional[Literal['pandas', 'polars']] = None,
lineage: bool = False # Enable lineage tracking
) -> DataFrame
Parameters:
bring_to(DataFrame): Target DataFramebring_from(DataFrame): Source DataFramebring(str | list): Column(s) to bringagainst(str | list): Key column(s) to matchposition(str | int): Where to place columns ('start', 'end', 'after:col', 'before:col', or int)strategy(dict): Column-level control (aggregation, rename, position)join_type(str): Join type ('lookup', 'left', 'inner', 'outer')logging(bool | str): Logging level (False, True, 'default')as_type(str): Output format (None, 'pandas', 'polars')lineage(bool): Enable lineage tracking (default: False)
Returns:
- DataFrame: DataFrame with new columns added
add.transform()
Transform data within a DataFrame.
Signature:
def transform(
mode: str, # Transform mode
df, # DataFrame to transform
columns: Optional[Union[str, List[str]]] = None,
*,
where: Optional[str] = None,
by: Optional[Union[str, List[str]]] = None,
position: Union[str, int] = 'end',
strategy: Optional[Dict[str, Any]] = None,
logging: Union[bool, str] = 'default',
as_type: Optional[Literal['pandas', 'polars']] = None,
lineage: bool = False # Enable lineage tracking
) -> DataFrame
Parameters:
mode(str): Transform mode ('@calc', '@filter', '@sort', '@aggregate', '@harmonize', '@round', '@transpose', '@extract', '@onehotencode', '@deduce')df(DataFrame): Input DataFramecolumns(str | list): Column(s) to operate onwhere(str): Filter condition (for @filter)by(str | list): Group/sort columnsposition(str | int): Where to place new columnsstrategy(dict): Mode-specific optionslogging(bool | str): Logging levelas_type(str): Output formatlineage(bool): Enable lineage tracking (default: False)
Returns:
- DataFrame: Transformed DataFrame
add.synthetic()
Generate synthetic data.
Signature:
def synthetic(
mode: str, # Synthetic mode
df: Optional[DataFrame] = None, # DataFrame (for @augment/@analyze)
n: Optional[int] = None, # Number of rows
*,
strategy: Optional[Dict[str, Any]] = None, # Column generation strategies
seed: int = 42, # Random seed
logging: Union[bool, str] = 'default',
as_type: Optional[Literal['pandas', 'polars']] = None,
lineage: bool = False # Enable lineage tracking
) -> DataFrame
Parameters:
mode(str): Synthetic mode ('@new', '@augment', '@analyze'/'@analyse')df(DataFrame): Input DataFrame (for @augment/@analyze)n(int): Number of rows to generatestrategy(dict): Column generation strategiesseed(int): Random seed (default: 42)logging(bool | str): Logging levelas_type(str): Output formatlineage(bool): Enable lineage tracking (default: False)
Returns:
- DataFrame: Generated or augmented DataFrame
add.scan() (NEW!)
Scan DataFrames for statistical profiling and lineage tracking.
Signature:
def scan(
mode: str, # Scan mode
df, # DataFrame to scan
*,
columns: Optional[Union[str, List[str]]] = None, # Column filter
where: Optional[str] = None, # Row filter condition
rows: Optional[str] = None, # Row range (first:N, last:N, M-N)
trace: Optional[List[int]] = None, # Cell trace [col_idx, row_idx]
focus: Optional[str] = None, # Focus mode
as_type: Optional[Literal['dataframe', 'dict', 'text']] = 'text'
) -> Union[DataFrame, Dict, str]
Parameters:
mode(str): Scan mode ('@analyze' or '@lineage')df(DataFrame): Input DataFramecolumns(str | list): Filter output to specific columnswhere(str): Filter rows by conditionrows(str): Row range specification ('first:10', 'last:5', '10-20')trace(list): Cell coordinates for tracing [column_index, row_index]focus(str): Focus mode for detailed analysis- For @analyze: 'outliers', 'correlations', 'distributions'
- For @lineage: 'nulls', 'excluded', 'source:name'
as_type(str): Output format ('text', 'dataframe', 'dict')
Returns:
- str | DataFrame | dict: Scan results in requested format
Examples:
# Statistical profiling
stats = add.scan('@analyze', df)
outliers = add.scan('@analyze', df, focus='outliers')
# Lineage tracking
lineage = add.scan('@lineage', df) # Requires lineage=True in operations
null_analysis = add.scan('@lineage', df, focus='nulls')
cell_history = add.scan('@lineage', df, trace=[2, 5])
Transform Modes
@calc - Calculate New Columns
Calculate new columns from expressions.
Example:
result = add.transform('@calc', df, strategy={
'total': 'price * quantity',
'discount': 'total * 0.1'
})
@filter - Filter Rows
Filter rows based on conditions.
Example:
result = add.transform('@filter', df, where='age > 18 AND status == "active"')
@sort - Sort DataFrame
Sort DataFrame by columns.
Example:
result = add.transform('@sort', df, by='date', strategy={'order': 'desc'})
@aggregate - Group and Aggregate
Group by columns and aggregate.
Example:
result = add.transform('@aggregate', df, by='category', strategy={'amount': 'sum'})
@harmonize - Harmonize Units
Harmonize units across columns (10 sub-modes).
Example:
result = add.transform('@harmonize:weight', df) # Creates weight_kg
@round - Round Numbers
Round numbers (creates NEW columns).
Example:
result = add.transform('@round:2', df, columns='price') # Creates price_round
@transpose - Transpose DataFrame
Transpose DataFrame.
Example:
result = add.transform('@transpose', df)
@extract - Extract Patterns
Extract patterns from text or dates.
Example:
result = add.transform('@extract', df, columns='date', pattern='dd-MM-yyyy')
@onehotencode - One-Hot Encode
One-hot encode categorical columns.
Example:
result = add.transform('@onehotencode', df, columns='category')
@deduce - Fill Missing Values
Fill missing values using 7 methods.
Methods: auto, mean, median, mode, forward, backward, knn
Example:
# Mean imputation
result = add.transform('@deduce', df, columns='age', method='mean')
# KNN imputation
result = add.transform('@deduce', df, columns=['age', 'salary'], method='knn',
strategy={'k': 3})
Error Handling
Additory provides clear, actionable error messages:
try:
# Using tuple instead of list
result = add.to(orders, bring_from=customers, bring=['name'],
against=('customer_id', 'date'))
except TypeError as e:
print(e)
# Parameter 'against' must be a list, not tuple.
# Use ['customer_id', 'date'] instead of ('customer_id', 'date')
try:
# Column not found
result = add.transform('@calc', df, strategy={'result': 'nonexistent + 5'})
except RuntimeError as e:
print(e)
# Column 'nonexistent' not found in DataFrame
# Available columns: a, b, c
All errors include:
- Clear description of what went wrong
- Contextual information (available options, etc.)
- Actionable suggestions for fixing the problem
Migration from v0.1.3a5
If you're upgrading from v0.1.3a5, here are the key changes:
Parameter Renames
# OLD (v0.1.3a5)
add.to(orders, fetch_from=customers, fetch=['name'], by='customer_id')
# NEW (v0.1.3a9)
add.to(orders, bring_from=customers, bring=['name'], against='customer_id')
Lists Instead of Tuples
# OLD (v0.1.3a5)
add.to(orders, fetch_from=customers, fetch=['name'], by=('id', 'date'))
# NEW (v0.1.3a9)
add.to(orders, bring_from=customers, bring=['name'], against=['id', 'date'])
Removed Functions
# OLD (v0.1.3a5)
add.set(logging=True)
add.deduce(df, 'age', method='mean')
# NEW (v0.1.3a9)
# add.set() removed - use logging parameter per function
add.to(..., logging=True)
add.transform(..., logging=True)
# add.deduce() moved to transform mode
add.transform('@deduce', df, columns='age', method='mean')
@round Creates NEW Columns
# OLD (v0.1.3a5)
# @round modified columns in-place
# NEW (v0.1.3a9)
# @round creates NEW columns
result = add.transform('@round:2', df, columns='price')
# Creates: price_round (original price column unchanged)
Development
Running Tests
# Integration tests
cd python-specific
pytest tests/test_integration.py -v
# Rust tests
cd rust-core
cargo test --all
# Benchmarks
cd python-specific
python benchmarks/benchmark_integration.py
Building from Source
# Build Rust module
cd rust-core
cargo build --release
# Install Python package
cd python-specific
pip install -e .
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
License
See LICENSE file for details.
Support
For issues or questions:
- Check API Documentation
- Review Usage Examples
- Contact development team
Changelog
See CHANGELOG.md for version history and changes.
Acknowledgments
Built with:
- Rust - Systems programming language
- PyO3 - Rust bindings for Python
- Polars - Fast DataFrame library
- Pandas - Data analysis library
Status: Alpha Release
Version: 0.1.3a9
Date: March 8, 2026
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file additory-0.1.3a9-py3-none-any.whl.
File metadata
- Download URL: additory-0.1.3a9-py3-none-any.whl
- Upload date:
- Size: 71.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
966119c6e05a14d37d8931bc6c9b4d7c9b11f4e1f06adb636e3aead208fd536c
|
|
| MD5 |
f3d1107d0c5de2f14e6fc21d5a48a495
|
|
| BLAKE2b-256 |
a197019be7431714642afbf41e82cf9f914ed7dc633c63a0cec24fde86d71dcd
|