Skip to main content

peperoncino: A library for easy data processing for pandas

Project description

peperoncino: A library for easy data processing for pandas

Install

$ pip install peperoncino

How to use

Processing DataFrame

import peperoncino as pp

pipeline = pp.Pipeline(
    # query data
    pp.Query("bar <= 3"),
    # assign new feature
    pp.Assign(hoge="foo * bar"),
    # generate combination feature
    pp.Combinations(["foo", "baz"], ["*", "/"]),
    # target encoding
    pp.TargetEncoding(["baz"], "y", ref=0),
    # select features
    pp.Select(
        ["hoge", "*_foo_baz", "TARGET_ENC_baz_BY_y", "y"],
        lackable_cols=["y"],
    )
)

# execute the processing
train_df, val_df, test_df = \
    pipeline.process([train_df, val_df, test_df])

Predefined processings

name description
ApplyColumn Apply a function to a column.
AsCategory Assign category dtype to columns.
Assign Assign a feature by a formula.
Combinations Create combination features.
DropColumns Drop columns.
DropDuplicates Drop duplicate rows.
Pipeline Chain processings.
Query Query rows by a given condition.
RenameCOlumns Rename columns.
Select Select columns.
StatsEncoding Encode columns by statistical values of another column.
TargetEncoding Target Encoding with smoothing.

Define processing

All processings are subclass of pp.BaseProcessing.
All you need is define the _process(self, dfs: List[pd.DataFrame]) -> List[pd.DataFrame] function.

class ExampleProcessing(pp.BaseProcessing):
    def _process(self, dfs: List[pd.DataFrame]) -> List[pd.DataFrame]:
        return [df + 1 for df in dfs]

If your processing doesn't depent on each other data frames, then use pp.SeparatedProcessing.

class ExampleProcessing(pp.SeparatedProcessing):
    def sep_process(self, df: pd.DataFrame) -> pd.DataFrame:
        return df * 2

If you need to merge all dataframes and then apply your processing, use pp.MergedProcessing.

class ExampleProcessing(pp.SeparatedProcessing):
    def simul_process(self, df: pd.DataFrame) -> pd.DataFrame:
        return df.assign(col1_mean=df['col1'].mean())

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

peperoncino-0.0.5.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

peperoncino-0.0.5-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file peperoncino-0.0.5.tar.gz.

File metadata

  • Download URL: peperoncino-0.0.5.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.2 CPython/3.7.5 Linux/5.0.0-1027-azure

File hashes

Hashes for peperoncino-0.0.5.tar.gz
Algorithm Hash digest
SHA256 f7cc3fb2a4e18278544dedd5590c798f89b01825d76baf1c8cf3e407bcedb1fc
MD5 42009a63501a98150de9ebd2fdffca69
BLAKE2b-256 a933ed4e3b05e0df6fa5ffccdde07b9a00113eedf3ca2492fac0f3d28ee326b1

See more details on using hashes here.

File details

Details for the file peperoncino-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: peperoncino-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.2 CPython/3.7.5 Linux/5.0.0-1027-azure

File hashes

Hashes for peperoncino-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c019a0422a371b38de126d1de072f805a547fe676e82b38f7605e8172aa92fc4
MD5 3e4b6c587f33b561d68e35713fdc4ca6
BLAKE2b-256 170db8c96148f06d30dee8861225ea6204d112cbea45fbd880a4312e709033af

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page