Scripts for processing, profiling, and publishing data
Project description
Lucid, in the Sky, with Data.
A collection of scripts for manipulating and visualizing data from databases, dataframes, and cloud providers.
Installation
Inside your repo: git clone https://github.com/liquidcarbon/lucid.git
Usage
Add lucid/*
to .gitignore
. If needed, amend sys.path
:
import sys
sys.path.append('/path/to/lucid/')
import lucid
Viz
Bokeh is my favorite plotting library.
lucid.viz.TrueFalsePlot
for highlighting boolean relationships (see example: is COVID incidence slowing or accelerating?)lucid.viz.CDF
for cumulative density function plots, with Kolmogorov–Smirnov / Kruskal–Wallis stats
import numpy as np
import pandas as pd
from bokeh.io import show
mu, sigma = 48, 20
data1 = pd.Series(np.random.normal(mu, sigma, 1000))
data2 = pd.Series(np.random.normal(52, 25, 2000))
cdf = lucid.viz.CDF('CDF distributions with optional KS metrics')
cdf.add_series(data1, 'rand normal 1', 'green')
cdf.add_series(data2, 'rand normal 2', 'red')
cdf.ks()
cdf.polish(xlabel='random') # default range: 0 to 100
show(cdf.p)
Databases
- wrappers for
pd.read_sql
:- tell you
df.shape
and SQL errors without 99 lines of traceback - adds query itself as a dataframe attribute:
df.q
, so you never forget which query produced which dataframe
- tell you
- wrappers for common SQL queries:
lucid.db.cd
for COUNT(DISTINCT ...)lucid.db.cgb
for COUNT(*) ... GROUP BYlucid.db.rcn
for RacCooN counts (rows, cardinality, nulls)
- table walk: data profiling tool that walks through every column of a table and returns cardinality, count of NULL values, and top N values as a dataframe
- schema walk: table walk across all tables in a schema
Dataframes
A bunch of functions I found myself writing more than once, including:
lucid.df.ntop
: like table walk, but for a dataframe (rows, cardinality, nulls)lucid.df.drop_empty_columns
: drop columns that are 100% NULLlucid.df.gresample
: combine GROUP BY and resample for time series data
IO
Writing interactive jQuery web tables from pandas. Writing multi-tab Excel files from pandas. Working with streams.
Cloud Providers
Some AWS and GCP wrappers.
Logging
I practice a flavor of log-driven development. Almost every function in lucid
talks to you when it succeeds:
210306@02:16:09.180 DEBUG [lucid] lucid package (re)loaded
210306@02:16:09.913 INFO [lucid] [read_data]: read 3340 x 420 columns
210306@02:16:09.975 INFO [lucid] [agg_by_state]: aggregated to 56 x 54 columns
210306@02:16:09.982 INFO [lucid] [derivative] calculated derivative 1
210306@02:16:09.984 INFO [lucid] [derivative] calculated derivative 2
210306@02:16:10.487 INFO [lucid.io] [_make_j2html_basic] published page covid_weekly.html
...and when it fails:
210224@11:30:32.188 INFO [read_xpt_batch] loading dataset RXQ_RX_H ...
210224@11:30:40.732 ERROR [read_xpt_batch] Something is wrong: 'utf-8' codec can't decode byte 0xf6 in position 18: invalid start byte
Boilerplate to enable logging to a file (f
) or to a notebook (h
) — pick one or both:
import logging
import sys
# I like this log format
formatter = logging.Formatter(
fmt='%(asctime)s.%(msecs)03d %(levelname)s [%(name)s] %(message)s',
datefmt='%y%m%d@%H:%M:%S',
)
lulogger = logging.getLogger('lucid')
lulogger.setLevel(logging.DEBUG) # change to INFO or lower for fewer messages
f = logging.FileHandler('lucid.log')
f.setFormatter(formatter)
h = logging.StreamHandler(stream=sys.stdout)
h.setFormatter(formatter)
if not lulogger.hasHandlers():
lulogger.addHandler(f) # log to file
lulogger.addHandler(h) # log to STDOUT or Jupyter
import lucid
You should get: 210306@02:16:09.180 DEBUG [lucid] lucid package (re)loaded
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for lucid_data-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 48a299cfed5219bff6925bade09d4b56e7e7c4df7d54f1edf429d1ae9abcab7f |
|
MD5 | 87352b4c57a2a3a10844aa238bf4af7c |
|
BLAKE2b-256 | dbeced229488f999d6fc091422e5b0d0c491afb9d377f8eff39067bf19b2bb8c |