CaTabRa-pandas is a library with additional functionality for pandas
Project description
CaTabRa-pandas
About • Quickstart • References • Contact • Acknowledgments
About
CaTabRa-pandas is a Python library with a couple of useful functions for efficiently working with pandas DataFrames. In particular, many of these functions are concerned with DataFrames containing intervals, i.e., DataFrames with (at least) two columns "start"
and "stop"
defining the left and right endpoints of intervals.
Highlights:
- Resample observations with respect to arbitrary (possibly irregular, possibly overlapping) windows:
catabra_pandas.resample_eav
andcatabra_pandas.resample_interval
. - Compute the intersection, union, difference, etc. of intervals:
catabra_pandas.combine_intervals
. - Group intervals by their distance to each other:
catabra_pandas.group_intervals
. - For each point in a given DataFrame, find the interval that contains it:
catabra_pandas.find_containing_interval
. - Find the previous/next observation for each entry in a DataFrame of timestamped observations:
catabra_pandas.prev_next_values
.
Each of these functions lacks a native pandas implementation, and is implemented extremely efficiently in CaTabRa-pandas. DataFrames with 10M+ rows are no problem!
Dask DataFrames are partly supported, too.
If you are interested in CaTabRa-pandas, you might be interested in CaTabRa, too: CaTabRa is a full-fledged tabular data analysis framework that enables you to calculate statistics, generate appealing visualizations and train machine learning models with a single command.
Quickstart
CaTabRa-pandas has minimal requirements and can be installed in every environment with Python >= 3.6 and pandas >= 1.0.
Once installed, CaTabRa-pandas can be readily used:
import pandas as pd
import catabra_pandas
# use-case: resample observations wrt. given windows
observations = pd.DataFrame(
data={
"subject_id": [0, 0, 0, 0, 1, 1],
"attribute": ["HR", "Temp", "HR", "HR", "Temp", "HR"],
"timestamp": [1, 1, 5, 7, 2, 3],
"value": [82.7, 36.9, 79.5, 78.7, 37.2, 89.4]
}
)
windows = pd.DataFrame(
data={
("subject_id", ""): [0, 0, 1],
("timestamp", "start"): [0, 4, 1],
("timestamp", "stop"): [6, 8, 4]
}
)
catabra_pandas.resample_eav(
observations,
windows,
agg={
"HR": ["mean", "p75", "r-1"], # mean value, 75-th percentile, last observed value
"Temp": ["count", "mode"] # standard deviation, mode
},
entity_col="subject_id",
time_col="timestamp",
attribute_col="attribute",
value_col="value"
)
import pandas as pd
import catabra_pandas
# use-case: find containing intervals
# note: intervals must be pairwise disjoint (in each group)
intervals = pd.DataFrame(
data={
"subject_id": [0, 0, 1],
"start": [0.5, 3.0, -10.7],
"stop": [2.3, 10., 10.7]
}
)
points = pd.DataFrame(
data={
"subject_id": [0, 0, 0, 1, 1],
"point": [1.0, 2.5, 9.9, 0.0, -8.8]
}
)
catabra_pandas.find_containing_interval(
points,
intervals,
["point"],
start_col="start",
stop_col="stop",
group_by="subject_id"
)
References
If you use CaTabRa-pandas in your research, we would appreciate citing the following conference paper:
-
A. Maletzky, S. Kaltenleithner, P. Moser and M. Giretzlehner. CaTabRa: Efficient Analysis and Predictive Modeling of Tabular Data. In: I. Maglogiannis, L. Iliadis, J. MacIntyre and M. Dominguez (eds), Artificial Intelligence Applications and Innovations (AIAI 2023). IFIP Advances in Information and Communication Technology, vol 676, pp 57-68, 2023. DOI:10.1007/978-3-031-34107-6_5
@inproceedings{CaTabRa2023, author = {Maletzky, Alexander and Kaltenleithner, Sophie and Moser, Philipp and Giretzlehner, Michael}, editor = {Maglogiannis, Ilias and Iliadis, Lazaros and MacIntyre, John and Dominguez, Manuel}, title = {{CaTabRa}: Efficient Analysis and Predictive Modeling of Tabular Data}, booktitle = {Artificial Intelligence Applications and Innovations}, year = {2023}, publisher = {Springer Nature Switzerland}, address = {Cham}, pages = {57--68}, isbn = {978-3-031-34107-6}, doi = {10.1007/978-3-031-34107-6_5} }
Contact
If you have any inquiries, please open a GitHub issue.
Acknowledgments
This project is financed by research subsidies granted by the government of Upper Austria. RISC Software GmbH is Member of UAR (Upper Austrian Research) Innovation Network.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file catabra_pandas-0.0.1.tar.gz
.
File metadata
- Download URL: catabra_pandas-0.0.1.tar.gz
- Upload date:
- Size: 39.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 964256c1807d7b25fd2fa36910f9e415e7d1cccbedcaa0a6705c6476d392c6a6 |
|
MD5 | a9efd7f63c5a458c3f6085c785b486f0 |
|
BLAKE2b-256 | 20d30bf54c5e80c65f90ae6b29d01ad700f999846e238b3bf6e439f52a15f8f2 |
File details
Details for the file catabra_pandas-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: catabra_pandas-0.0.1-py3-none-any.whl
- Upload date:
- Size: 39.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f65ebb7a35e4d73ec108549d7dff08faded704c68fa4a2dde742d11b25c867c |
|
MD5 | 5c0130e157977e4b23596a8089f4d567 |
|
BLAKE2b-256 | c39c6ad6079658b4d840ccd5494f02b395f884b5593522c21ce1e518e7721f8b |