Skip to main content

CaTabRa-pandas is a library with additional functionality for pandas

Project description

CaTabRa-pandas

AboutQuickstartReferencesContactAcknowledgments

Platform Support Platform Support Platform Support

About

CaTabRa-pandas is a Python library with a couple of useful functions for efficiently working with pandas DataFrames. In particular, many of these functions are concerned with DataFrames containing intervals, i.e., DataFrames with (at least) two columns "start" and "stop" defining the left and right endpoints of intervals.

Highlights:

  • Resample observations with respect to arbitrary (possibly irregular, possibly overlapping) windows: catabra_pandas.resample_eav and catabra_pandas.resample_interval.
  • Compute the intersection, union, difference, etc. of intervals: catabra_pandas.combine_intervals.
  • Group intervals by their distance to each other: catabra_pandas.group_intervals.
  • For each point in a given DataFrame, find the interval that contains it: catabra_pandas.find_containing_interval.
  • Find the previous/next observation for each entry in a DataFrame of timestamped observations: catabra_pandas.prev_next_values.

Each of these functions lacks a native pandas implementation, and is implemented extremely efficiently in CaTabRa-pandas. DataFrames with 10M+ rows are no problem!

Dask DataFrames are partly supported, too.

If you are interested in CaTabRa-pandas, you might be interested in CaTabRa, too: CaTabRa is a full-fledged tabular data analysis framework that enables you to calculate statistics, generate appealing visualizations and train machine learning models with a single command.

Quickstart

CaTabRa-pandas has minimal requirements (Python >= 3.6, pandas >= 1.0) and can be easily installed using pip:

pip install catabra-pandas

Once installed, CaTabRa-pandas can be readily used.

Use-Case: Merge DataFrames based on Overlapping Intervals

import pandas as pd
import catabra_pandas

left = pd.DataFrame(data=dict(start=[0, 7, 1, 8], stop=[2, 8, 5, 9]))
right = pd.DataFrame(data=dict(start=[10, 4, 0], stop=[11, 5, 3]))

catabra_pandas.merge_intervals(
    left,
    right,
    how="inner",
    left_start="start",
    left_stop="stop",
    right_start="start",
    right_stop="stop"
)

Note: This is a special case of a conditional join. Conditional joins are not supported by pandas by default, but are available in pyjanitor and Polars. The catabra-pandas implementation of interval-overlap- and interval-containment-joins is extremely fast and memory-efficient, as can be seen in these benchmarks.

Use-Case: Resample Observations wrt. Observation Windows

import pandas as pd
import catabra_pandas

observations = pd.DataFrame(
    data={
        "subject_id": [0, 0, 0, 0, 1, 1],
        "attribute": ["HR", "Temp", "HR", "HR", "Temp", "HR"],
        "timestamp": [1, 1, 5, 7, 2, 3],
        "value": [82.7, 36.9, 79.5, 78.7, 37.2, 89.4]
    }
)
windows = pd.DataFrame(
    data={
        ("subject_id", ""): [0, 0, 1],
        ("timestamp", "start"): [0, 4, 1],
        ("timestamp", "stop"): [6, 8, 4]
    }
)
catabra_pandas.resample_eav(
    observations,
    windows,
    agg={
        "HR": ["mean", "p75", "r-1"],   # mean value, 75-th percentile, last observed value
        "Temp": ["count", "mode"]     # standard deviation, mode
    },
    entity_col="subject_id",
    time_col="timestamp",
    attribute_col="attribute",
    value_col="value"
)

Use-Case: Find Containing Intervals

import pandas as pd
import catabra_pandas

intervals = pd.DataFrame(
    data={
        "subject_id": [0, 0, 1],
        "start": [0.5, 3.0, -10.7],
        "stop": [2.3, 10., 10.7]
    }
)
points = pd.DataFrame(
    data={
        "subject_id": [0, 0, 0, 1, 1],
        "point": [1.0, 2.5, 9.9, 0.0, -8.8]
    }
)
catabra_pandas.find_containing_interval(
    points,
    intervals,
    ["point"],
    start_col="start",
    stop_col="stop",
    group_by="subject_id"
)

References

If you use CaTabRa-pandas in your research, we would appreciate citing the following conference paper:

  • A. Maletzky, S. Kaltenleithner, P. Moser and M. Giretzlehner. CaTabRa: Efficient Analysis and Predictive Modeling of Tabular Data. In: I. Maglogiannis, L. Iliadis, J. MacIntyre and M. Dominguez (eds), Artificial Intelligence Applications and Innovations (AIAI 2023). IFIP Advances in Information and Communication Technology, vol 676, pp 57-68, 2023. DOI:10.1007/978-3-031-34107-6_5

    @inproceedings{CaTabRa2023,
      author = {Maletzky, Alexander and Kaltenleithner, Sophie and Moser, Philipp and Giretzlehner, Michael},
      editor = {Maglogiannis, Ilias and Iliadis, Lazaros and MacIntyre, John and Dominguez, Manuel},
      title = {{CaTabRa}: Efficient Analysis and Predictive Modeling of Tabular Data},
      booktitle = {Artificial Intelligence Applications and Innovations},
      year = {2023},
      publisher = {Springer Nature Switzerland},
      address = {Cham},
      pages = {57--68},
      isbn = {978-3-031-34107-6},
      doi = {10.1007/978-3-031-34107-6_5}
    }
    

Contact

If you have any inquiries, please open a GitHub issue.

Acknowledgments

This project is financed by research subsidies granted by the government of Upper Austria. RISC Software GmbH is Member of UAR (Upper Austrian Research) Innovation Network.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catabra_pandas-0.0.5.tar.gz (55.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

catabra_pandas-0.0.5-py3-none-any.whl (56.8 kB view details)

Uploaded Python 3

File details

Details for the file catabra_pandas-0.0.5.tar.gz.

File metadata

  • Download URL: catabra_pandas-0.0.5.tar.gz
  • Upload date:
  • Size: 55.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10

File hashes

Hashes for catabra_pandas-0.0.5.tar.gz
Algorithm Hash digest
SHA256 8a6b59b0d1b36dc3957d004b3bb414dc71f85ea17e6d7862bcf83c6eb7b5f894
MD5 6d5476a7e9d72e2ab4e87c3c327db2c7
BLAKE2b-256 2f08fde50ba5c1a27e2d6865ea957d9cbbcec6f7eb7f63820bb161dbe32eb60b

See more details on using hashes here.

File details

Details for the file catabra_pandas-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: catabra_pandas-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 56.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10

File hashes

Hashes for catabra_pandas-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 31b933076cc026274669c70b11495192b01121841a379f7489b5b98ab9dc666b
MD5 4f798dce69182cfcd1de754042942f19
BLAKE2b-256 a49147c59a81e17bba154e5856bdc0fc01e33b31d84dd18b5f7c2003b29bc522

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page