routine pandas method chain links wrapped with scikit-lego
Project description
Purpose
- a collection of pandas DataFrame methods and functions I commonly used decorated with scikit-lego's
@log_step
to enable logging within method chaining - Wanted to use the logging of common pandas methods across projects
- Inspired from Vincent D. Warmerdam's PyData Eindhoven 2019 talk titled Untitled12.ipynb
- See the docs for pandas pipelines of scikit-lego to view the general use case for
@log_step
- See
examples/demo.ipynb
for a few workflows and links to other relevant SE questions
In [1]: import pandas as pd; import numpy as np; import logging
In [2]: import pandas_pipe
In [3]: stream_handler = logging.StreamHandler()
In [4]: pandas_pipe.logger.addHandler(stream_handler)
In [5]: dataf_input = pd.DataFrame(
...: {
...: "aaa": {0: 10, 1: 11, 2: 12, 3: 13},
...: "myid": {0: 1, 1: 2, 2: 3, 3: 4},
...: "num": {0: "1, 2, 3", 1: np.nan, 2: "1, 2", 3: np.nan},
...: "text": {0: "aa, bb, cc", 1: np.nan, 2: "cc, dd", 3: "ee"},
...: "states": {
...: 0: "Wyoming; Illinois; New Hampshire",
...: 1: "Pennsylvania",
...: 2: "New York",
...: 3: "Pennsylvania",
...: },
...: }
...: )
In [6]: dataf_input
Out[6]:
aaa myid num text states
0 10 1 1, 2, 3 aa, bb, cc Wyoming; Illinois; New Hampshire
1 11 2 NaN NaN Pennsylvania
2 12 3 1, 2 cc, dd New York
3 13 4 NaN ee Pennsylvania
In [7]: (
...: dataf_input.pipe(pandas_pipe.start_pipeline)
...: .pipe(pandas_pipe.explode_setup, columns=["num", "text"], delimiter=",")
...: .pipe(pandas_pipe.explode_setup, columns=["states"], delimiter=";")
...: .pipe(pandas_pipe.explode, column="num", ignore_index=False)
...: .pipe(pandas_pipe.explode, column="text", ignore_index=False)
...: .pipe(pandas_pipe.explode, column="states", ignore_index=False)
...: .pipe(pandas_pipe.drop_duplicates)
...: )
[start_pipeline(df)] time=0:00:00.000152 n_obs=4, n_col=5 names=['aaa', 'myid', 'num', 'text', 'states']
[explode_setup(df, columns = ['num', 'text'], delimiter = ',')] time=0:00:00.001107 n_obs=4, n_col=5 delta=(0, 0)
[explode_setup(df, columns = ['states'], delimiter = ';')] time=0:00:00.000466 n_obs=4, n_col=5 delta=(0, 0)
[explode(df, column = 'num', ignore_index = False)] time=0:00:00.003697 n_obs=7, n_col=5 delta=(+3, 0)
[explode(df, column = 'text', ignore_index = False)] time=0:00:00.002642 n_obs=15, n_col=5 delta=(+8, 0)
[explode(df, column = 'states', ignore_index = False)] time=0:00:00.002184 n_obs=33, n_col=5 delta=(+18, 0)
[drop_duplicates(df)] time=0:00:00.001671 n_obs=33, n_col=5 delta=(0, 0)
Out[7]:
aaa myid num text states
0 10 1 1 aa Wyoming
0 10 1 1 aa Illinois
0 10 1 1 aa New Hampshire
0 10 1 1 bb Wyoming
0 10 1 1 bb Illinois
0 10 1 1 bb New Hampshire
0 10 1 1 cc Wyoming
0 10 1 1 cc Illinois
0 10 1 1 cc New Hampshire
0 10 1 2 aa Wyoming
0 10 1 2 aa Illinois
0 10 1 2 aa New Hampshire
0 10 1 2 bb Wyoming
0 10 1 2 bb Illinois
0 10 1 2 bb New Hampshire
0 10 1 2 cc Wyoming
0 10 1 2 cc Illinois
0 10 1 2 cc New Hampshire
0 10 1 3 aa Wyoming
0 10 1 3 aa Illinois
0 10 1 3 aa New Hampshire
0 10 1 3 bb Wyoming
0 10 1 3 bb Illinois
0 10 1 3 bb New Hampshire
0 10 1 3 cc Wyoming
0 10 1 3 cc Illinois
0 10 1 3 cc New Hampshire
1 11 2 NaN NaN Pennsylvania
2 12 3 1 cc New York
2 12 3 1 dd New York
2 12 3 2 cc New York
2 12 3 2 dd New York
3 13 4 NaN ee Pennsylvania
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pandas_pipe-0.0.1.tar.gz
(5.2 kB
view hashes)
Built Distribution
Close
Hashes for pandas_pipe-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c29f2766e3823667f08f961d59a09db9e4d90dc7b1548151361a4aa7b1ebfaa7 |
|
MD5 | 7b4eb68b3427ec3952179a2231fa456d |
|
BLAKE2b-256 | 98a2ba9ab6b05496c1bfaaa1b2274c12621dfbe23b46b586853fcbf832fe806f |