Elegant data operations for DataFrames - add.to(), add.transform(), add.synthetic()
Project description
additory
Elegant data operations for DataFrames
A Rust-powered Python library for intuitive data transformations, lookups, and synthetic data generation with Polars and Pandas.
Features
- 🔗 Intuitive Lookups - Add columns from external sources with simple syntax
- ⚡ Powerful Transforms - Calculate, filter, sort, aggregate with mode-based operations
- 🎲 Synthetic Data - Generate realistic test data or augment existing datasets
- 📊 Lineage Tracking - Track data transformations and view operation history
- 🔍 Data Scanning - Analyze data quality and inspect DataFrames
- 🚀 Rust Performance - Built with Rust for blazing-fast operations
- 🐼 Polars & Pandas - Works seamlessly with both DataFrame libraries
- 📚 Expression Library - 179 built-in expressions for medical, finance, physics, and more
Installation
pip install additory
Requirements:
- Python 3.8+
- Polars (required)
- Pandas (optional)
Quick Start
import additory as add
import polars as pl
# Add data from external sources
orders = pl.DataFrame({'id': [1, 2], 'customer_id': [101, 102]})
customers = pl.DataFrame({'customer_id': [101, 102], 'name': ['Alice', 'Bob']})
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id')
# Transform data
df = pl.DataFrame({'x': [1, 2, 3]})
result = add.transform('@calc', df, strategy={'x_squared': 'x ** 2'})
# Generate synthetic data
result = add.synthetic('@new', n=100, strategy={'age': 'normal(40, 10)'})
Core Functions
add.to() - Add Data from External Sources
result = add.to(bring_to, bring_from=reference_df, bring=['column'], against='key',
lineage=False)
Perfect for lookups and joins. Enable lineage=True to track data sources.
add.transform() - Transform Data
result = add.transform(mode, df, lineage=False, **parameters)
Available modes:
@calc- Calculate new columns with expressions@filter- Filter rows and select columns@sort- Sort data by columns@aggregate- Group and aggregate data@harmonize- Harmonize units (10 sub-modes)@round- Round numbers (creates NEW columns)@transpose- Transpose DataFrame@extract- Extract patterns from text/dates@onehotencode- One-hot encode categorical columns@deduce- Fill missing values (7 methods)
add.synthetic() - Synthetic Data
result = add.synthetic(mode, df_or_n, lineage=False, **parameters)
Available modes:
@new- Create synthetic DataFrames from scratch@augment- Add synthetic rows to existing data
add.scan() - Inspect and Analyze DataFrames
result = add.scan(mode, df)
Available modes:
@analyze/@analyse- Analyze data quality and distributions@lineage- View lineage tracking reports (requireslineage=Truein operations)
Strategy Parameter
The strategy parameter provides fine-grained control over operations in all three functions.
add.to() Strategy
Control aggregation, renaming, and positioning for brought columns:
Simple form (aggregation only):
strategy={'amount': 'sum', 'date': 'last'}
Complex form (full control):
strategy={
'amount': {
'mode': 'sum',
'rename': 'total_spent',
'position': 'after:customer_id'
}
}
Aggregation modes: first, last, sum, count, average, min, max, concat, concat[sep], most_common, least_common, median, std, variance, unique_count
add.transform() Strategy
Mode-specific configuration:
@calc - Expressions for new columns:
strategy={'total': 'price * quantity', 'discount': 'total * 0.1'}
@sort - Sort order:
strategy={'order': 'desc'} # or 'asc'
@aggregate - Aggregation functions:
strategy={'amount': 'sum', 'count': 'count'}
@round - Custom naming and positioning:
strategy={
'price': {'name': 'price_clean', 'position': 'after:price'}
}
@deduce - KNN parameters:
strategy={'k': 5, 'weights': 'distance'}
add.synthetic() Strategy
Column generation specifications:
Simple form:
strategy={'id': 'increment', 'age': 'normal(40, 10)'}
Complex form:
strategy={
'name': {'type': 'choice', 'values': ['Alice', 'Bob', 'Charlie']},
'age': {'type': 'normal', 'mean': 35, 'std': 10}
}
Generation types: increment, pattern, choice, normal, uniform, lognormal, exponential, poisson, categorical
Lineage Tracking
Track data transformations across operations to understand data provenance and transformation history.
Enable Lineage Tracking
import additory as add
import pandas as pd
# Enable lineage in any operation
result = add.to(customers, bring_from=orders, bring=['amount'],
against='customer_id', lineage=True)
# Lineage is preserved across operations
result = add.transform('@calc', result, expression='amount * 1.1',
name='total', lineage=True)
# View lineage report
lineage_report = add.scan('@lineage', result)
print(lineage_report)
Lineage Features
- Operation History - Track all transformations applied to data
- Column Sources - See where each column came from
- Row Mappings - Track how rows were filtered or aggregated
- Session-Only - Lineage is stored in-memory (not persisted to disk)
- Mutual Exclusion - Cannot use
lineage=Truewithas_typeparameter
Lineage Example
# Multi-step workflow with lineage
customers = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Carol']})
orders = pd.DataFrame({'id': [1, 1, 2, 3, 3], 'amount': [100, 150, 200, 175, 125]})
# Step 1: Bring data
df = add.to(customers, bring_from=orders, bring=['amount'], against='id',
strategy={'amount': 'sum'}, lineage=True)
# Step 2: Calculate
df = add.transform('@calc', df, expression='amount * 1.1', name='total', lineage=True)
# Step 3: Filter
df = add.transform('@filter', df, where='total > 200', lineage=True)
# View complete lineage
report = add.scan('@lineage', df)
# Shows: 3 operations, column sources, row transformations
Important Notes
- Lineage is session-only by design (follows "no file I/O" philosophy)
- Lineage metadata is lost when DataFrames are saved with native methods
- Cannot use
lineage=Truewithas_typeparameter (metadata would be lost during conversion) - Lineage overhead is minimal (<3ms per operation)
Documentation
📚 Complete documentation is available in the /docs directory:
-
API Reference - Complete function signatures and API documentation
- Quick Reference - Fast lookup guide
- Reference Manual - Comprehensive API docs
- Function Signatures - All signatures with lineage support
-
User Guides - Step-by-step tutorials and concepts
- Migration Guide - Upgrading from older versions
- Lineage User Story - Understanding lineage tracking
- Deduce Explained - Missing value imputation guide
-
Examples - 20+ Quarto notebooks with runnable examples
- add.to() examples (5 notebooks)
- add.transform() examples (5 notebooks)
- add.synthetic() examples (4 notebooks)
- add.scan() examples (3 notebooks)
- Lineage tracking examples (2 notebooks)
- Troubleshooting Guide
See docs/README.md for the complete documentation index.
Examples
Lookup Example
import additory as add
import polars as pl
# Orders with customer IDs
orders = pl.DataFrame({
'order_id': [1, 2, 3],
'customer_id': [101, 102, 101],
'amount': [100, 200, 150]
})
# Customer reference data
customers = pl.DataFrame({
'customer_id': [101, 102],
'name': ['Alice', 'Bob'],
'city': ['NYC', 'LA']
})
# Add customer info to orders
result = add.to(orders, bring_from=customers, bring=['name', 'city'], against='customer_id')
Transform Example
# Calculate with expressions
df = pl.DataFrame({'price': [100, 200, 300], 'quantity': [2, 3, 1]})
result = add.transform('@calc', df, strategy={'total': 'price * quantity'})
# Filter data
result = add.transform('@filter', df, where='price > 150')
# Sort data
result = add.transform('@sort', df, by='price', strategy={'order': 'desc'})
# Aggregate data
df = pl.DataFrame({'category': ['A', 'B', 'A'], 'value': [10, 20, 30]})
result = add.transform('@aggregate', df, by='category', strategy={'value': 'sum'})
# Round numbers (creates NEW columns)
df = pl.DataFrame({'price': [10.567, 20.123, 30.999]})
result = add.transform('@round:2', df, columns='price') # Creates price_round
# Fill missing values
df = pl.DataFrame({'age': [25, None, 35, None, 45]})
result = add.transform('@deduce', df, columns='age', method='mean')
Synthetic Data Example
# Create synthetic data
result = add.synthetic('@new', n=1000, strategy={
'age': 'normal(40, 10)',
'salary': 'normal(75000, 15000)',
'score': 'uniform(0, 100)'
})
# Augment existing data
df = pl.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
result = add.synthetic('@augment', df, n=100)
# Analyze data quality
result = add.synthetic('@analyze', df)
Version
Current version: 0.1.3 (Stable Alpha)
What's New in v0.1.3
- ✅ Lineage Tracking - Track data transformations with
lineage=Trueparameter - ✅ add.scan() Function - Unified interface for
@analyzeand@lineagemodes - ✅ ~95% Rust Implementation - Optimized code distribution for performance
- ✅ Mutual Exclusion Validation - Clear error messages for
lineage+as_type - ✅ Helper Functions - Internal utilities for lineage tracking
- ✅ Bug Fixes - Fixed add.to() parameter mapping bug
- ✅ Code Cleanup - Removed orphan files and dead code
- ✅ 341/341 Tests Passing - 100% test coverage
Development
Building from Source
# Clone the repository
git clone https://github.com/YOUR_USERNAME/additory.git
cd additory
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Build the package
cd rust-core
export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1
maturin build --release
# Install locally
pip install target/wheels/*.whl
Running Tests
# Run comprehensive test suite
python test_all_modes_comprehensive.py
# Run specific tests
pytest tests/
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details
Changelog
v0.1.3 (March 9, 2026)
- Lineage Tracking - Track data transformations across operations
- add.scan() Function - Unified scanning interface (@analyze, @lineage)
- ~95% Rust Implementation - Optimized Python/Rust code distribution
- Bug Fixes - Fixed add.to() parameter mapping, cleaned up code
- 341/341 Tests Passing - Complete test coverage
v0.1.3a9 (March 4, 2026)
- Updated API signatures for natural language (bring_to, bring_from, bring)
- Lists everywhere instead of tuples
- @round creates NEW columns (philosophy compliant)
- @deduce mode for missing value imputation
- @extract merged with datetime parsing
- Removed add.set() and add.deduce() functions
- Default seed=42 for reproducibility
- 100% philosophy compliance
v0.1.3a3 (February 9, 2026)
- Made pandas optional
- Added cross-platform build scripts
- Fixed pandas import issues
- 100% test pass rate
v0.1.3a2 (February 9, 2026)
- Added banker's rounding (@bankers_round mode)
- Expanded expression library to 179 expressions
- Fixed mode detection issues
- Fixed power operator (
**) support
v0.1.3a1 (February 2026)
- Initial alpha release
- Rust core with PyO3 bindings
- Three-function API (to, transform, synthetic)
Support
For issues, questions, or contributions, please visit:
- GitHub Issues: [Coming Soon]
- Documentation: [Coming Soon]
Credits
Built with:
Made with ❤️ for the data science community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file additory-0.1.3a10-cp313-cp313-manylinux_2_39_x86_64.whl.
File metadata
- Download URL: additory-0.1.3a10-cp313-cp313-manylinux_2_39_x86_64.whl
- Upload date:
- Size: 11.9 MB
- Tags: CPython 3.13, manylinux: glibc 2.39+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3e75f89641c992863ffff94dd00cef8ea98a485192d8e73b4fc20355c2ca5ca
|
|
| MD5 |
d1c9764674b6b75f122698ed688649f6
|
|
| BLAKE2b-256 |
e3ea60e8990c106e33b086dba2858972f8ddfd9c10a21d2b2f29f8b02a5bb2d7
|