A small library with Pandas-Like api used for function ops execution and data transforms.
Project description
Dataruns
A powerful Python library for function pipeline execution and convenient data transformations. Build easy pipelines to execute different ops on your data. It is built on top of Pandas and Numpy.
Features
✨ Core Capabilities:
- Pipeline Execution: Chain multiple data transformations seamlessly
- Pandas-Like API: Familiar interface if you know pandas
- Multiple Data Sources: Load from CSV, Excel, SQLite, and URLs
- Built-in Transforms: Standard scalers, missing value handlers, column selection
- NumPy & Pandas Support: Works with both arrays and DataFrames
- Stateful Operations: Transforms remember their state (mean, std) for consistent results
Installation
pip install dataruns
Or with uv:
uv add dataruns
Quick Start
Basic Pipeline
from dataruns import Pipeline, standard_scaler, fill_na
import pandas as pd
# Create sample data
df = pd.DataFrame({
'age': [20, 30, 40],
'salary': [30000, 50000, 70000]
})
# Create a pipeline
pipeline = Pipeline(
fill_na(strategy='mean'), # Fill missing values
standard_scaler() # Standardize the data
)
# Execute the pipeline
result = pipeline(df)
print(result)
Load Data from Files
from dataruns import CSVSource, XLSsource, SQLiteSource
# From CSV
csv_source = CSVSource('data.csv')
df = csv_source.extract_data()
# From Excel
excel_source = XLSsource('data.xlsx', sheet_name='Sheet1')
df = excel_source.extract_data()
# From SQLite
sqlite_source = SQLiteSource('database.db', 'SELECT * FROM my_table')
df = sqlite_source.extract_data()
# From URL
csv_source = CSVSource(url='https://example.com/data.csv')
df = csv_source.extract_data()
Quick Convenience Functions
from dataruns import load_csv
# Load CSV quickly
data = load_csv('data.csv')
Core Concepts
Pipelines
Pipeline: Execute transforms sequentially
from dataruns import Pipeline
pipeline = Pipeline(transform1, transform2, transform3, verbose=True)
result = pipeline(data)
Make_Pipeline: Builder pattern for dynamic construction
from dataruns import Make_Pipeline
builder = Make_Pipeline()
builder.add(fill_na(strategy='mean'))
builder.add(standard_scaler())
pipeline = builder.build()
Available Transforms
from dataruns.core.transforms import get_transforms
# This lists out all available transforms that have been implemented
print(get_transforms())
Complete Example
from dataruns import Pipeline, load_csv
from dataruns.core.transforms import select_columns, fill_na, standard_scaler
import numpy as np
# Load data
data = load_csv('customers.csv')
# Create comprehensive pipeline
pipeline = Pipeline(
fill_na(strategy='mean'), # Handle missing values
select_columns(['age', 'income']), # Keep relevant columns
standard_scaler(), # Normalize for ML
verbose=True # Show each step
)
# Process data
result = pipeline(data)
# Use with machine learning models
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(result)
Data Sources
Datasources that are supported include CSVSource, XLSsource, SQLiteSource. More to come soon
from dataruns import CSVSource, XLSsource, SQLiteSource
# CSV
source = CSVSource(file_path='data.csv')
# or from URL
source = CSVSource(url='https://example.com/data.csv')
# Excel
source = XLSsource(file_path='data.xlsx', sheet_name='Sheet1')
# SQLite
source = SQLiteSource(
connection_string='database.db',
query='SELECT * FROM users WHERE age > 18'
)
# Extract data
df = source.extract_data()
Important Notes
Stateful Transforms
Transforms remember their state from the first call:
scaler = standard_scaler()
# First call: learns mean/std from data1
result1 = scaler(data1)
# Second call: reuses data1's statistics
result2 = scaler(data2) # Normalized using data1's mean/std!
This matches scikit-learn's fit/transform pattern. Create new transform instances for independent scaling:
scaler1 = standard_scaler() # For data1
result1 = scaler1(data1)
scaler2 = standard_scaler() # For data2 (fresh state)
result2 = scaler2(data2)
Working with Different Data Types
- Dataruns is built on
pandas DataframeandNumPy ndarray
import numpy as np
import pandas as pd
from dataruns import Pipeline, standard_scaler
# Works with arrays
array = np.array([[1, 2], [3, 4]])
pipeline(array)
# Works with DataFrames
df = pd.DataFrame({'a': [1, 3], 'b': [2, 4]})
pipeline(df)
# Works with lists (converted to array)
lst = [[1, 2], [3, 4]]
pipeline(lst)
Development
Install development dependencies:
uv add --dev pytest pytest-cov ruff black
Run tests:
uv run pytest
Run with coverage:
uv run pytest --cov=src/dataruns
Lint code:
uv run ruff check src/
Format code:
uv run black src/
License
MIT License - see LICENSE file for details
Author
Daniel Ali
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Issues
Do note that not all tests were marked as passed(about 8) but these tests are very niche tests Found a bug? Please report it on our issue tracker
Changelog
See CHANGELOG.md for version history and updates.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataruns-0.2.0.tar.gz.
File metadata
- Download URL: dataruns-0.2.0.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d16f7767ca554beb9f19d1ec7ed1168c45e8c083c7f00b4cfe147ec0a4f96e08
|
|
| MD5 |
2d8f979bda22d1245a52a14c86ab2a9d
|
|
| BLAKE2b-256 |
cf7adb56b2b90b31520dc7477fe3875b58b5cf6c4fec8335f642342efcf8868e
|
File details
Details for the file dataruns-0.2.0-py3-none-any.whl.
File metadata
- Download URL: dataruns-0.2.0-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2985e2766a3aa241ec96c16db23204561b14d3f68aff6f613b06f669f55b9a9c
|
|
| MD5 |
46199171a559570026a8226dec91d65c
|
|
| BLAKE2b-256 |
1599e4b478f70846078f8cc42a57f1f9aba22657a684afbe8dfd4a824ca3e3fb
|