Skip to main content

A few useful tools for data wrangling

Project description

dutil

A few data utilities to make life of a data scientist easier

Installation

pip install dutil

Modules

  • pipeline (data caching and pipelines)
  • stats (statistical functions)
  • string (string manipulations)
  • transform (data transformations)
  • jupyter (tools for jupyter notebooks)

Pipeline

import dutil.pipeline as dpipe
import pandas as pd
import numpy as np
from loguru import logger

# --- Define data transformations via step functions (similar to dask.delayed)

@dpipe.delayed_cached()  # lazy computation + caching on disk
def load_1():
    df = pd.DataFrame({'a': [1., 2.], 'b': [0.1, np.nan]})
    logger.info('Loaded {} records'.format(len(df)))
    return df

@dpipe.delayed_cached()  # lazy computation + caching on disk
def load_2(timestamp):
    df = pd.DataFrame({'a': [0.9, 3.], 'b': [0.001, 1.]})
    logger.info('Loaded {} records'.format(len(df)))
    return df

@dpipe.delayed_cached()  # lazy computation + caching on disk
def compute(x, y, eps):
    assert x.shape == y.shape
    diff = ((x - y).abs() / (y.abs()+eps)).mean().mean()
    logger.info('Difference is computed')
    return diff

# Define pipeline dependencies
ts = pd.Timestamp(2019, 1, 1)
eps = 0.01
s1 = load_1()
s2 = load_2(ts)
diff = compute(s1, s2, eps)

# Trigger pipeline execution
print('diff: {:.3f}'.format(dpipe.delayed_compute((diff, ))[0]))

Stats

from dutil.stats import mean_lower, mean_upper
import pandas as pd
ss = pd.Series([0, 1, 5, -1])
mean_lower(ss)  # Compute mean among 50% smallest elements
mean_upper(ss)  # Compute mean among 50% biggest elements

String

from dutil.string import compare_companies
compare_companies("Aarons Holdings Company Inc.", "Aaron's, Inc.")  # Give match rating for two company names

Transform

from dutil.transform import ht
import pandas as pd
df = pd.DataFrame({'a': [0, 2, 2, 4, 6], 'b': [1, 1, 1, 1, 1]})
ht(df)  # Return first and last rows of a DataFrame, a Series, or an array

Jupyter

from dutil.jupyter import dht
import pandas as pd
df = pd.DataFrame({'a': [0, 2, 2, 4, 6], 'b': [1, 1, 1, 1, 1]})
dht(df)  # Display first and last rows of a DataFrame, a Series, or an array in a Jupyter notebook

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dutil-0.2.24.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

dutil-0.2.24-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file dutil-0.2.24.tar.gz.

File metadata

  • Download URL: dutil-0.2.24.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.9.19 Linux/6.5.0-1018-azure

File hashes

Hashes for dutil-0.2.24.tar.gz
Algorithm Hash digest
SHA256 3ecf531003419ce4dfe764cd80fe3526f4c32007524c356e8036e46af06bb8d5
MD5 3b8774b2c5d13f8b25e7b302dc16a25f
BLAKE2b-256 72734be0bfa7c727459902cddd3fc776b0fbcae0f56814ac005ae229abd869d4

See more details on using hashes here.

File details

Details for the file dutil-0.2.24-py3-none-any.whl.

File metadata

  • Download URL: dutil-0.2.24-py3-none-any.whl
  • Upload date:
  • Size: 14.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.9.19 Linux/6.5.0-1018-azure

File hashes

Hashes for dutil-0.2.24-py3-none-any.whl
Algorithm Hash digest
SHA256 a7483159cfa99e9da4bca1be6ccd0730a2cb34beb53314c6703a774eea34c4ee
MD5 1d6e51eb97d6771fc48cceb187298736
BLAKE2b-256 1b13a4f92be1ada57fc170de3893c0890b1069f858c5f918e4eef5be00bced99

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page