A few useful tools for data wrangling
Project description
dutil
A few data utilities to make life of a data scientist easier
Installation
pip install dutil
Modules
pipeline
(data caching and pipelines)stats
(statistical functions)string
(string manipulations)transform
(data transformations)jupyter
(tools for jupyter notebooks)
Pipeline
import dutil.pipeline as dpipe
import pandas as pd
import numpy as np
from loguru import logger
# --- Define data transformations via step functions (similar to dask.delayed)
@dpipe.delayed_cached() # lazy computation + caching on disk
def load_1():
df = pd.DataFrame({'a': [1., 2.], 'b': [0.1, np.nan]})
logger.info('Loaded {} records'.format(len(df)))
return df
@dpipe.delayed_cached() # lazy computation + caching on disk
def load_2(timestamp):
df = pd.DataFrame({'a': [0.9, 3.], 'b': [0.001, 1.]})
logger.info('Loaded {} records'.format(len(df)))
return df
@dpipe.delayed_cached() # lazy computation + caching on disk
def compute(x, y, eps):
assert x.shape == y.shape
diff = ((x - y).abs() / (y.abs()+eps)).mean().mean()
logger.info('Difference is computed')
return diff
# Define pipeline dependencies
ts = pd.Timestamp(2019, 1, 1)
eps = 0.01
s1 = load_1()
s2 = load_2(ts)
diff = compute(s1, s2, eps)
# Trigger pipeline execution
print('diff: {:.3f}'.format(dpipe.delayed_compute((diff, ))[0]))
Stats
from dutil.stats import mean_lower, mean_upper
import pandas as pd
ss = pd.Series([0, 1, 5, -1])
mean_lower(ss) # Compute mean among 50% smallest elements
mean_upper(ss) # Compute mean among 50% biggest elements
String
from dutil.string import compare_companies
compare_companies("Aarons Holdings Company Inc.", "Aaron's, Inc.") # Give match rating for two company names
Transform
from dutil.transform import ht
import pandas as pd
df = pd.DataFrame({'a': [0, 2, 2, 4, 6], 'b': [1, 1, 1, 1, 1]})
ht(df) # Return first and last rows of a DataFrame, a Series, or an array
Jupyter
from dutil.jupyter import dht
import pandas as pd
df = pd.DataFrame({'a': [0, 2, 2, 4, 6], 'b': [1, 1, 1, 1, 1]})
dht(df) # Display first and last rows of a DataFrame, a Series, or an array in a Jupyter notebook
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dutil-0.2.24.tar.gz
(13.3 kB
view details)
Built Distribution
dutil-0.2.24-py3-none-any.whl
(14.9 kB
view details)
File details
Details for the file dutil-0.2.24.tar.gz
.
File metadata
- Download URL: dutil-0.2.24.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.9.19 Linux/6.5.0-1018-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ecf531003419ce4dfe764cd80fe3526f4c32007524c356e8036e46af06bb8d5 |
|
MD5 | 3b8774b2c5d13f8b25e7b302dc16a25f |
|
BLAKE2b-256 | 72734be0bfa7c727459902cddd3fc776b0fbcae0f56814ac005ae229abd869d4 |
File details
Details for the file dutil-0.2.24-py3-none-any.whl
.
File metadata
- Download URL: dutil-0.2.24-py3-none-any.whl
- Upload date:
- Size: 14.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.9.19 Linux/6.5.0-1018-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7483159cfa99e9da4bca1be6ccd0730a2cb34beb53314c6703a774eea34c4ee |
|
MD5 | 1d6e51eb97d6771fc48cceb187298736 |
|
BLAKE2b-256 | 1b13a4f92be1ada57fc170de3893c0890b1069f858c5f918e4eef5be00bced99 |