Pandas DataFrame subclasses that enforce structure and can self-organize.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Software Development :: Libraries :: Python Modules

Project description

Typed DataFrames

Pandas DataFrame subclasses that enforce structure and self-organize.
*Because your functions can’t exactly accept any DataFrame**.
pip install typeddfs[feather,fwf]

Stop passing index_cols= and header= to to_csv and read_csv. Your “typed” dataframes will remember how they’re supposed to be written and read. That means columns are used for the index, string columns are always read as strings, and custom constraints are verified.

Need to read a tab-delimited file? read_file("myfile.tab"). Feather? Parquet? HDF5? .json.zip? Gzipped fixed-width? XML? Use read_file. Write a file? Use write_file.

Some useful extra functions, plus various Pandas issues fixed:

read_csv/to_csv, read_json/to_json, etc., are inverses. read_file/write_file, too
You can always read and write empty DataFrames -- that doesn't raise weird exceptions. Typed-dfs will always read in what you wrote out.
No more empty .feather/.snappy/.h5 files written on error.
You can write fixed-width as well as read.

from typeddfs._entries import TypedDfs

MyDfType = (
    TypedDfs.typed("MyDfType")
    .require("name", index=True)  # always keep in index
    .require("value", dtype=float)  # require a column and type
    .drop("_temp")  # auto-drop a column
    .verify(lambda ddf: len(ddf) == 12)  # require exactly 12 rows
).build()

df = MyDfType.read_file(input("filename? [.feather/.csv.gz/.tsv.xz/etc.]"))
df.sort_natural().write_file("myfile.feather")

🎨 More complex example

For a CSV like this:

key	value	note
abc	123	?

from typeddfs._entries import TypedDfs

# Build me a Key-Value-Note class!
KeyValue = (
    TypedDfs.typed("KeyValue")  # With enforced reqs / typing
    .require("key", dtype=str, index=True)  # automagically add to index
    .require("value")  # required
    .reserve("note")  # permitted but not required
    .strict()  # disallow other columns
).build()

# This will self-organize and use "key" as the index:
df = KeyValue.read_csv("example.csv")

# For fun, let"s write it and read it back:
df.to_csv("remke.csv")
df = KeyValue.read_csv("remake.csv")
print(df.index_names(), df.column_names())  # ["key"], ["value", "note"]


# And now, we can type a function to require a KeyValue,
# and let it raise an `InvalidDfError` (here, a `MissingColumnError`):
def my_special_function(df: KeyValue) -> float:
    return KeyValue(df)["value"].sum()

All of the normal DataFrame methods are available. Use .untyped() or .vanilla() to make a detyped copy that doesn’t enforce requirements.

🔌 Serialization support

Like Pandas, TypedDfs can read and write to various formats. It provides the methods read_file and write_file, which guess the format from the filename extension. For example, df.write_file("myfile.snappy) writes Parquet files, and df.write_file("myfile.tab.gz") writes a gzipped, tab-delimited file. The read_file method works the same way: MyDf.read_file("myfile.feather") will read an Apache Arrow Feather file, and MyDf.read_file("myfile.json.gzip")reads a gzipped JSON file. You can pass keyword arguments to those functions.

Serialization is provided through Pandas, and some formats require additional packages. Pandas does not specify compatible versions, so typed-dfs specifies extras are provided in typed-dfs to ensure that those packages are installed with compatible versions.

To install with Feather support, use pip install typeddfs[feather].
To install with support for all serialization formats, use pip install typeddfs[feather] fastparquet tables.

However, hdf5 and parquet have limited compatibility, restricted to some platforms and Python versions. In particular, neither is supported in Python 3.9 on Windows as of 2021-03-02. (See the llvmlite issue and tables issue.)

Feather offers massively better performance over CSV, gzipped CSV, and HDF5 in read speed, write speed, memory overhead, and compression ratios. Parquet typically results in smaller file sizes than Feather at some cost in speed. Feather is the preferred format for most cases.

⚠ Note: The hdf5 and parquet extras are currently disabled.

format	packages	extra	compatibility	performance
pickle	none	none	❗ ️	−
csv	none	none	✅	−−
json	none	none	/️	−−-
xml	`lxml`	`xml`	.	---
.npy †	none	none	†️	+
.npz †	none	none	†️	+
flexwf	none	`fwf`	✅	−−-
Feather	`pyarrow`	`feather`	✅	++++
Parquet	`pyarrow,fastparquet`	`parquet`	❌	+++
HDF5	`tables`	`hdf5`	❌	−

❗ == Pickle is explicitly not supported due to vulnerabilities and other issues.
/ == Mostly. JSON has inconsistent handling of None.
† == .npy and .npz only serialize numpy objects and therefore skip indices.
. = requires Pandas 1.3+
Note: .flexwf is fixed-width with optional delimiters; .fwf is not used to avoid a potential future conflict with pd.DataFrame.to_fwf (which does not exist yet).

📝 Extra notes

A small note of caution: natsort is not pinned to a specific major version because it receives somewhat frequent major updates. This means that the result of typed-df’s sort_natural could change. You can pin natsort to a specific major version; e.g. natsort = "^7" with Poetry or natsort>=7,<8 with pip.

Fixed-width format is provided through Pandas read_fwf but can be written via tabulate.

🍁 Contributing

Typed-Dfs is licensed under the Apache License, version 2.0. New issues and pull requests are welcome. Please refer to the contributing guide. Generated with Tyrannosaurus.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

0.17.0a0 pre-release

Oct 24, 2023

0.16.5

Mar 1, 2022

0.16.4

Nov 7, 2021

0.16.3

Nov 6, 2021

0.16.2

Nov 5, 2021

0.16.1

Nov 1, 2021

0.16.0

Oct 24, 2021

0.15.0

Oct 17, 2021

0.14.4

Oct 13, 2021

0.14.3

Oct 6, 2021

0.14.2

Oct 5, 2021

0.14.1

Sep 23, 2021

0.14.0

Sep 23, 2021

0.13.3

Sep 18, 2021

0.13.2

Sep 13, 2021

0.13.0

Sep 12, 2021

0.12.0

Sep 8, 2021

0.11.0

Aug 24, 2021

0.10.1

Aug 22, 2021

0.10.0

Aug 22, 2021

0.9.0

Aug 4, 2021

0.8.1

Aug 4, 2021

0.8.0

Aug 2, 2021

This version

0.7.1

Jul 20, 2021

0.7.0

Jun 9, 2021

0.6.1

Apr 1, 2021

0.6.0

Mar 31, 2021

0.5.0

Feb 4, 2021

0.4.0

Jan 6, 2021

0.3.0

Aug 30, 2020

0.2.2

Aug 26, 2020

0.2.1

Aug 9, 2020

0.2.0

May 20, 2020

0.1.0

May 13, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

typeddfs-0.7.1.tar.gz (30.4 kB view hashes)

Uploaded Jul 20, 2021 Source

Built Distribution

typeddfs-0.7.1-py3-none-any.whl (31.0 kB view hashes)

Uploaded Jul 20, 2021 Python 3

Hashes for typeddfs-0.7.1.tar.gz

Hashes for typeddfs-0.7.1.tar.gz
Algorithm	Hash digest
SHA256	`9edba0360920c5a94b1a5ab7b5d3465bdc8a48c72564134f60136d16020b8efe`
MD5	`d428e6e871df4d8ecdb94b48c343897a`
BLAKE2b-256	`5d648aefeb865f5af99ffef40e524d98756216c7c0c50e21b03ea71b75ade611`

Hashes for typeddfs-0.7.1-py3-none-any.whl

Hashes for typeddfs-0.7.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`74c3931d55ff81086a602736fbce784e67289c676d242a3c70f55cc244dcaf55`
MD5	`cb2103595fbde70ff83eb2c6c7f7ca91`
BLAKE2b-256	`0bcc6110ecaf7b2f39476329535a6ef072b7902fe102a98fffec16f39e9edb34`