Skip to main content

A small library with Pandas-Like api used for function ops execution and data transforms.

Project description

Dataruns

A powerful Python library for function pipeline execution and convenient data transformations. Build easy pipelines to execute different ops on your data. It is built on top of Pandas and Numpy.

Python License

Features

Core Capabilities:

  • Pipeline Execution: Chain multiple data transformations seamlessly
  • Pandas-Like API: Familiar interface if you know pandas
  • Multiple Data Sources: Load from CSV, Excel, SQLite, and URLs
  • Built-in Transforms: Standard scalers, missing value handlers, column selection
  • NumPy & Pandas Support: Works with both arrays and DataFrames
  • Stateful Operations: Transforms remember their state (mean, std) for consistent results

Installation

pip install dataruns

Or with uv:

uv add dataruns

Quick Start

Basic Pipeline

from dataruns import Pipeline, standard_scaler, fill_na
import pandas as pd

# Create sample data
df = pd.DataFrame({
    'age': [20, 30, 40],
    'salary': [30000, 50000, 70000]
})

# Create a pipeline
pipeline = Pipeline(
    fill_na(strategy='mean'),      # Fill missing values
    standard_scaler()               # Standardize the data
)

# Execute the pipeline
result = pipeline(df)
print(result)

Load Data from Files

from dataruns import CSVSource, XLSsource, SQLiteSource

# From CSV
csv_source = CSVSource('data.csv')
df = csv_source.extract_data()

# From Excel
excel_source = XLSsource('data.xlsx', sheet_name='Sheet1')
df = excel_source.extract_data()

# From SQLite
sqlite_source = SQLiteSource('database.db', 'SELECT * FROM my_table')
df = sqlite_source.extract_data()

# From URL
csv_source = CSVSource(url='https://example.com/data.csv')
df = csv_source.extract_data()

Quick Convenience Functions

from dataruns import load_csv

# Load CSV quickly
data = load_csv('data.csv')

Core Concepts

Pipelines

Pipeline: Execute transforms sequentially

from dataruns import Pipeline

pipeline = Pipeline(transform1, transform2, transform3, verbose=True)
result = pipeline(data)

Make_Pipeline: Builder pattern for dynamic construction

from dataruns import Make_Pipeline

builder = Make_Pipeline()
builder.add(fill_na(strategy='mean'))
builder.add(standard_scaler())
pipeline = builder.build()

Available Transforms

from dataruns.core.transforms import get_transforms

# This lists out all available transforms that have been implemented
print(get_transforms())

Complete Example

from dataruns import Pipeline, load_csv
from dataruns.core.transforms import select_columns, fill_na, standard_scaler
import numpy as np

# Load data
data = load_csv('customers.csv')

# Create comprehensive pipeline
pipeline = Pipeline(
    fill_na(strategy='mean'),           # Handle missing values
    select_columns(['age', 'income']),  # Keep relevant columns
    standard_scaler(),                  # Normalize for ML
    verbose=True                        # Show each step
)

# Process data
result = pipeline(data)

# Use with machine learning models
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(result)

Data Sources

Datasources that are supported include CSVSource, XLSsource, SQLiteSource. More to come soon

from dataruns import CSVSource, XLSsource, SQLiteSource

# CSV
source = CSVSource(file_path='data.csv')
# or from URL
source = CSVSource(url='https://example.com/data.csv')

# Excel
source = XLSsource(file_path='data.xlsx', sheet_name='Sheet1')

# SQLite
source = SQLiteSource(
    connection_string='database.db',
    query='SELECT * FROM users WHERE age > 18'
)

# Extract data
df = source.extract_data()

Important Notes

Stateful Transforms

Transforms remember their state from the first call:

scaler = standard_scaler()

# First call: learns mean/std from data1
result1 = scaler(data1)

# Second call: reuses data1's statistics
result2 = scaler(data2)  # Normalized using data1's mean/std!

This matches scikit-learn's fit/transform pattern. Create new transform instances for independent scaling:

scaler1 = standard_scaler()  # For data1
result1 = scaler1(data1)

scaler2 = standard_scaler()  # For data2 (fresh state)
result2 = scaler2(data2)

Working with Different Data Types

  • Dataruns is built on pandas Dataframe and NumPy ndarray
import numpy as np
import pandas as pd
from dataruns import Pipeline, standard_scaler

# Works with arrays
array = np.array([[1, 2], [3, 4]])
pipeline(array)

# Works with DataFrames
df = pd.DataFrame({'a': [1, 3], 'b': [2, 4]})
pipeline(df)

# Works with lists (converted to array)
lst = [[1, 2], [3, 4]]
pipeline(lst)

Development

Install development dependencies:

uv add --dev pytest pytest-cov ruff black

Run tests:

uv run pytest

Run with coverage:

uv run pytest --cov=src/dataruns

Lint code:

uv run ruff check src/

Format code:

uv run black src/

License

MIT License - see LICENSE file for details

Author

Daniel Ali

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Issues

Do note that not all tests were marked as passed(about 8) but these tests are very niche tests Found a bug? Please report it on our issue tracker

Changelog

See CHANGELOG.md for version history and updates.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataruns-0.2.0.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataruns-0.2.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file dataruns-0.2.0.tar.gz.

File metadata

  • Download URL: dataruns-0.2.0.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.15

File hashes

Hashes for dataruns-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d16f7767ca554beb9f19d1ec7ed1168c45e8c083c7f00b4cfe147ec0a4f96e08
MD5 2d8f979bda22d1245a52a14c86ab2a9d
BLAKE2b-256 cf7adb56b2b90b31520dc7477fe3875b58b5cf6c4fec8335f642342efcf8868e

See more details on using hashes here.

File details

Details for the file dataruns-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: dataruns-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.15

File hashes

Hashes for dataruns-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2985e2766a3aa241ec96c16db23204561b14d3f68aff6f613b06f669f55b9a9c
MD5 46199171a559570026a8226dec91d65c
BLAKE2b-256 1599e4b478f70846078f8cc42a57f1f9aba22657a684afbe8dfd4a824ca3e3fb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page