Skip to main content

Pandas DataFrame subclasses that enforce structure and can self-organize.

Project description

Typed DataFrames

Version status License Python version compatibility Version on Github Version on PyPi Build (Actions) Documentation status Coverage (coveralls) Maintainability Scrutinizer Code Quality Created with Tyrannosaurus

Pandas DataFrame subclasses that enforce structure and can self-organize.
Because your functions can’t exactly accept any DataFrame*.
pip install typeddfs[feather]

from typeddfs import TypedDfs
MyDfType = (
    TypedDfs.typed("MyDfType")
    .require("name", index=True)        # always keep in index
    .require("value", dtype=float)      # require a column and type
    .drop("_temp")                      # auto-drop a column
    .condition(lambda df: len(df)==12)  # require exactly 12 rows
).build()
# All normal Pandas functions work  (plus a few more, like sort_natural)

🎁 Features

  • Columns are turned into indices as needed, so read_csv and to_csv are inverses. MyDf.read_csv(mydf.to_csv()) is mydf.
  • DataFrames display elegantly in Jupyter notebooks.
  • Extra methods such as sort_natural and write_file.

🎨 Example

For a CSV like this:

key value note
abc 123 ?
from typeddfs import TypedDfs

# Build me a Key-Value-Note class!
KeyValue = (
    TypedDfs.typed("KeyValue")              # With enforced reqs / typing
    .require("key", dtype=str, index=True)  # automagically add to index
    .require("value")                       # required
    .reserve("note")                        # permitted but not required
    .strict()                               # disallow other columns
).build()

# This will self-organize and use "key" as the index:
df = KeyValue.read_csv("example.csv")

# For fun, let"s write it and read it back:
df.to_csv("remke.csv")
df = KeyValue("remake.csv")
print(df.index_names(), df.column_names())  # ["key"], ["value", "note"]

# And now, we can type a function to require a KeyValue,
# and let it raise an `InvalidDfError` (here, a `MissingColumnError`):
def my_special_function(df: KeyValue) -> float:
    return KeyValue(df)["value"].sum()

All of the normal DataFrame methods are available. Use .untyped() or .vanilla() to make a detyped copy that doesn’t enforce requirements. See the docs 📚 for more information.

🔌 Serialization support

Like Pandas, TypedDfs can read and write to various formats. It provides the methods read_file and write_file, which guess the format from the filename extension. For example, df.write_file("myfile.snappy) writes Parquet files, and df.write_file("myfile.tab.gz") writes a gzipped, tab-delimited file. The read_file method works the same way: MyDf.read_file("myfile.feather") will read an Apache Arrow Feather file, and MyDf.read_file("myfile.json.gzip")reads a gzipped JSON file. You can pass keyword arguments to those functions.

Serialization is provided through Pandas, and some formats require additional packages. Pandas does not specify compatible versions, so typed-dfs specifies extras are provided in typed-dfs to ensure that those packages are installed with compatible versions.

  • To install with Feather support, use pip install typeddfs[feather].
  • To install with support for all serialization formats, use pip install typeddfs[feather] fastparquet tables.

However, hdf5 and parquet have limited compatibility, restricted to some platforms and Python versions. In particular, neither is supported in Python 3.9 on Windows as of 2021-03-02. (See the llvmlite issue and tables issue.)

Feather offers massively better performance over CSV, gzipped CSV, and HDF5 in read speed, write speed, memory overhead, and compression ratios. Parquet typically results in smaller file sizes than Feather at some cost in speed. Feather is the preferred format for most cases.

⚠ Note: The hdf5 and parquet extras are currently disabled.

format packages extra compatibility performance
pickle none none ❗ ️
CSV none none −−
CSV.GZ none none −−
JSON none none /️ −−
JSON.GZ none none /️ −−
.npy † none none †️ +
.npz † none none †️ +
Feather pyarrow feather ++++
Parquet pyarrow,fastparquet parquet +++
HDF5 tables hdf5

❗ == Pickle is explicitly not supported due to vulnerabilities and other issues.
/ == Mostly. JSON has inconsistent handling of None.
† == .npy and .npz only serialize numpy objects and therefore skip indices.

📝 Extra notes

A small note of caution: natsort is not pinned to a specific major version because it receives somewhat frequent major updates. This means that the result of typed-df’s sort_natural could change. You can pin natsort to a specific major version; e.g. natsort = "^7" with Poetry or natsort>=7,<8 with pip.

🍁 Contributing

Typed-Dfs is licensed under the Apache License, version 2.0. New issues and pull requests are welcome. Please refer to the contributing guide.
Generated with Tyrannosaurus.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

typeddfs-0.6.0.tar.gz (24.4 kB view hashes)

Uploaded Source

Built Distribution

typeddfs-0.6.0-py3-none-any.whl (23.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page