Skip to main content

A semantic, extensible dataframe transformation engine with expressions, lookup, synthetic data, and sample-data support.

Project description

Additory

A semantic, extensible dataframe transformation engine with expressions, lookup, synthetic data, and sample-data support.

Python 3.9+ License: MIT Version

Author: Krishnamoorthy Sankaran

🛠️ Requirements

  • Python: 3.9+
  • Core dependencies: pandas, polars, numpy, scipy
  • Optional: cuDF (for GPU support)

📦 Installation

pip install additory==0.1.0a1

Optional GPU support:

pip install additory[gpu]==0.1.0a1  # Includes cuDF for GPU acceleration

Development installation:

pip install additory[dev]==0.1.0a1  # Includes testing and development tools

🎯 Core Functions

Function Purpose Example
add.to() Lookup/join operations add.to(df1, from_df=df2, bring='col', against='key')
add.augment() Generate additional data add.augment(df, n_rows=1000)
add.synth() Synthetic data from schemas add.synth("schema.toml", rows=5000)
add.scan() Data profiling & analysis add.scan(df, preset="full")

🧬 Available Expressions

Additory includes 12 built-in health and fitness expressions:

  • add.bmi() - Body Mass Index
  • add.bsa() - Body Surface Area
  • add.bmr() - Basal Metabolic Rate
  • add.waist_hip_ratio() - Waist-to-Hip Ratio
  • add.body_fat_percentage() - Body Fat Percentage
  • add.ideal_body_weight() - Ideal Body Weight
  • add.blood_pressure_category() - BP Classification
  • add.cholesterol_ratio() - Cholesterol Ratio
  • add.age_category() - Age Classification
  • add.fitness_score() - Overall Fitness Score
# Health calculations
patients = pd.DataFrame({
    'weight_kg': [70, 80, 65],  # Weight in kilograms
    'height_m': [1.75, 1.80, 1.60],  # Height in meters
    'age': [25, 35, 45],
    'gender': ['M', 'F', 'M']
})

patients_bmi = add.bmi(patients)
patients_bsa = add.bsa(patients)
fitness_scores = add.fitness_score(patients)

# Chain multiple expressions
result = add.fitness_score(add.bmr(add.bmi(patients)))

🔧 DataFrame Support

Additory works seamlessly with multiple DataFrame libraries:

  • pandas - Full support
  • polars - Full support
  • cuDF - GPU acceleration support
import polars as pl
import additory as add

# Works with polars
df_polars = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
result = add.augment(df_polars, n_rows=100)

# Automatic type detection and conversion

✨ Key Features

🔧 Utilities

add.to() - Data Lookup & Joins Simplified syntax for bringing columns from one dataframe to another.

# Simple lookup
orders_with_prices = add.to(
    orders, 
    from_df=products, 
    bring='price', 
    against='product_id'
)

# Multiple columns and keys
enriched = add.to(
    orders,
    from_df=products,
    bring=['price', 'category'],
    against=['product_id', 'region']
)

add.onehotencoding() - Categorical Encoding Convert categorical columns to one-hot encoded format.

# One-hot encoding (single column)
encoded = add.onehotencoding(df, 'category')

add.harmonize_units() - Unit Standardization Standardize units across your dataset.

# Unit harmonization
standardized = add.harmonize_units(
    df, 
    value_column='temperature', 
    unit_column='unit',
    target_unit='C'
)

🧮 Expressions

Pre-built calculations for health, fitness, and common metrics. Simple examples:

# Create patient data with correct column names
patients = pd.DataFrame({
    'weight_kg': [70, 80, 65],  # Weight in kilograms
    'height_m': [1.75, 1.80, 1.60],  # Height in meters
    'age': [25, 35, 45],
    'gender': ['M', 'F', 'M']
})

# Calculate BMI
patients_with_bmi = add.bmi(patients)

# Calculate Body Surface Area
patients_with_bsa = add.bsa(patients)

# Chain multiple expressions
result = add.fitness_score(add.bmr(add.bmi(patients)))

🔄 Augment and Synthetic Data

Augment generates more data similar to your existing dataset, while Synthetic creates entirely new datasets from schema definitions.

Key Differences:

  • Augment: Learns patterns from existing data to create similar rows
  • Synthetic: Uses predefined schemas to generate structured data
# Augment existing data (learns from patterns)
more_customers = add.augment(customers, n_rows=1000)

# Create data from scratch with strategies
new_data = add.augment("@new", n_rows=500, strategy={
    'id': 'increment:start=1',
    'name': 'choice:[John,Jane,Bob]',
    'age': 'range:18-65'
})

# Generate from schema file (structured approach)
customers = add.synth("customer_schema.toml", rows=10000)

🧪 Examples

E-commerce Data Pipeline

import pandas as pd
import additory as add

# Start with small customer sample
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'age': [25, 35, 45],
    'region': ['North', 'South', 'East']
})

# Generate more customers
customers = add.augment(customers, n_rows=10000)

# Add customer tiers
tiers = pd.DataFrame({
    'customer_id': range(1, 4),  # Match original IDs
    'tier': ['Gold', 'Silver', 'Bronze']
})

# Use pipeline approach
result = (customers
    .pipe(add.to, from_df=tiers, bring='tier', against='customer_id')
    .pipe(add.scan, preset="quick"))

print(result.summary())

Healthcare Data Analysis

# Create patient data from scratch
strategy = {
    'patient_id': 'increment:start=1',
    'age': 'range:18-80',
    'weight_kg': 'range:50-120',  # Weight in kg
    'height_cm': 'range:150-200'  # Height in cm
}

patients = add.augment("@new", n_rows=1000, strategy=strategy)

# Convert height to meters for expressions
patients['height_m'] = patients['height_cm'] / 100

# Calculate health metrics using pipeline
result = (patients
    .pipe(add.bmi)
    .pipe(add.scan, preset="correlations"))

print(result.correlations)

📚 Documentation

📄 License

MIT License - see LICENSE file for details.

📞 Support

🗺️ v0.1.1 (February 2025)

  • Enhanced documentation and tutorials
  • Performance optimizations
  • Additional expressions
  • Advanced synthetic data patterns

Made with ❤️ for data scientists, analysts, and developers who love working with data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

additory-0.1.0a1.tar.gz (231.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

additory-0.1.0a1-py3-none-any.whl (232.0 kB view details)

Uploaded Python 3

File details

Details for the file additory-0.1.0a1.tar.gz.

File metadata

  • Download URL: additory-0.1.0a1.tar.gz
  • Upload date:
  • Size: 231.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for additory-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 0b694142721d2fd61e9c91b678c96c7664ba04507b5503e5e00fda364c980218
MD5 96e9828cfc024baa229d150f148bc25a
BLAKE2b-256 f0d7e0591d1b5a62af660672d9a528e03d69eb985588221f4b15a6acc7d138b7

See more details on using hashes here.

File details

Details for the file additory-0.1.0a1-py3-none-any.whl.

File metadata

  • Download URL: additory-0.1.0a1-py3-none-any.whl
  • Upload date:
  • Size: 232.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for additory-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 9f6fc0b6896f2dc6c10ea9ad1203843bf37b0e2d1589ec0bc841a5dd72938bd8
MD5 d06a89f11b993b6e98044bd7cfc457d0
BLAKE2b-256 d23159d885c9fd47091052a3a7b6566932b82e143f54282a0d0fb84b0bdca264

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page