Pandas DataFrame subclasses that enforce structure and can self-organize.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Software Development :: Libraries :: Python Modules

Project description

Typed DataFrames

Pandas DataFrame subclasses that self-organize and read/write correctly.

Film = TypedDfs.typed("Film").require("name", "studio", "year").build()
df = Film.read_csv("file.csv")
assert df.columns.tolist() == ["name", "studio", "year"]

Your types will remember how they’re supposed to be read, including dtypes, columns for set_index, and custom requirements. Then you can stop passing index_cols=, header=, set_index, and astype each time you read. Instead, read_csv, read_parquet, ..., will just work.

You can also document your functions clearly, and read and write any format in a single file.

def hello(df: Film):
    print("read!")


df = Film.read_file("input file? [.csv/.tsv/.tab/.feather/.snappy/.json.gz/.h5/...]")
hello(df)

🐛 Pandas serialization bugs fixed

Pandas has several issues with serialization. Depending on the format and columns, these issues occur:

columns being silently added or dropped,
errors on either read or write of empty DataFrames,
the inability to use DataFrames with indices in Feather,
writing to Parquet failing with half-precision,
lingering partially written files on error,
the buggy xlrd being preferred by read_excel,
the buggy odfpy also being preferred,
writing a file and reading it back results in a different DataFrame,
you can’t write fixed-width format,
and the platform text encoding being used rather than utf-8.

🎁️ New methods, etc.

Docs coming soon...

🎨 Simple example

from typeddfs import TypedDfs

MyDfType = (
    TypedDfs.typed("MyDfType")
    .require("name", index=True)  # always keep in index
    .require("value", dtype=float)  # require a column and type
    .drop("_temp")  # auto-drop a column
    .verify(lambda ddf: len(ddf) == 12)  # require exactly 12 rows
).build()

df = MyDfType.read_file(input("filename? [.feather/.csv.gz/.tsv.xz/etc.]"))
df.sort_natural().write_file("myfile.feather", mkdirs=True)

📉 A matrix-style DataFrame

import numpy as np
from typeddfs import TypedDfs

Symmetric64 = (
    TypedDfs.matrix("Symmetric64", doc="A symmetric float64 matrix")
    .dtype(np.float64)
    .verify(lambda df: df.values.sum().sum() == 1.0)
    .add_methods(product=lambda df: df.flatten().product())
).build()

mx = Symmetric64.read_file("input.tab")
print(mx.product())  # defined above
if mx.is_symmetric():
    mx = mx.triangle()  # it's symmetric, so we only need half
    long = mx.drop_na().long_form()  # columns: "row", 'column", and "value"
    long.write_file("long-form.xml")

🔍 More complex example

For a CSV like this:

key	value	note
abc	123	?

from typeddfs import TypedDfs

# Build me a Key-Value-Note class!
KeyValue = (
    TypedDfs.typed("KeyValue")  # With enforced reqs / typing
    .require("key", dtype=str, index=True)  # automagically add to index
    .require("value")  # required
    .reserve("note")  # permitted but not required
    .strict()  # disallow other columns
).build()

# This will self-organize and use "key" as the index:
df = KeyValue.read_csv("example.csv")

# For fun, let"s write it and read it back:
df.to_csv("remake.csv")
df = KeyValue.read_csv("remake.csv")
print(df.index_names(), df.column_names())  # ["key"], ["value", "note"]

# And now, we can type a function to require a KeyValue,
# and let it raise an `InvalidDfError` (here, a `MissingColumnError`):
def my_special_function(df: KeyValue) -> float:
    return KeyValue(df)["value"].sum()

All of the normal DataFrame methods are available. Use .untyped() or .vanilla() to make a detyped copy that doesn’t enforce requirements. Use .of(df) to convert a DataFrame to your type.

💔 Limitations

Multi-level columns are not yet supported.
Columns and index levels cannot share names.
Duplicate column names are not supported. (These are strange anyway.)
A typed DF cannot have columns "level_0", "index", or "Unnamed: 0".
inplace is forbidden in some functions; avoid it or use .vanilla().

🔌 Serialization support

Like Pandas, TypedDfs can read and write to various formats. It provides the methods read_file and write_file, which guess the format from the filename extension. For example, df.write_file("myfile.snappy) writes Parquet files, and df.write_file("myfile.tab.gz") writes a gzipped, tab-delimited file. The read_file method works the same way: MyDf.read_file("myfile.feather") will read an Apache Arrow Feather file, and MyDf.read_file("myfile.json.gzip")reads a gzipped JSON file. You can pass keyword arguments to those functions.

Serialization is provided through Pandas, and some formats require additional packages. Pandas does not specify compatible versions, so typed-dfs specifies extras are provided in typed-dfs to ensure that those packages are installed with compatible versions.

To install with Feather support, use pip install typeddfs[feather].
To install with support for all formats, use pip install typeddfs[all].

Feather offers massively better performance over CSV, gzipped CSV, and HDF5 in read speed, write speed, memory overhead, and compression ratios. Parquet typically results in smaller file sizes than Feather at some cost in speed. Feather is the preferred format for most cases.

📊 Serialization in-depth

⚠ Note: The hdf5 extra is currently disabled.

format	packages	extra	sanity	speed	file sizes
Feather	`pyarrow`	`feather`	++	++++	+++
Parquet	`pyarrow` or `fastparquet` †	`parquet`	++	+++	++++
csv/tsv	none	none	++	−−	−−
flexwf ‡	none	none	++	−−	−−
.fwf	none	none	+	−−	−−
json	none	none	−−	−−−	−−−
xml	`lxml`	`xml`	+	−−−	−−−
.npy	none	none	++	+	+++
.npz	none	none	++	+	+++
.html	`html5lib,beautifulsoup4`	`html`	−−	−−−	−−−
pickle	none	none	−− ️	−	−
XLSX	`openpyxl,defusedxml`	`excel`	+	−−	+
ODS	`openpyxl,defusedxml`	`excel`	+	−−	+
XLS	`openpyxl,defusedxml`	`excel`	−−	−−	+
XLSB	`pyxlsb`	`xlsb`	−−	−−	++
HDF5	`tables`	`hdf5`	−−	−	++

Notes:

† fastparquet can be used instead. It is slower but much smaller.
‡ .flexwf is fixed-width with optional delimiters.
JSON has inconsistent handling of None. (orjson is more consistent).
XML requires Pandas 1.3+.
.npy and .npz only serialize numpy objects.
.html is not supported in read_file and write_file.
Pickle is insecure and not recommended.
Pandas supports odfpy for ODS and xlrd for XLS. In fact, it prefers those. However, they are very buggy; openpyxl is much better.
XLSM, XLTX, XLTM, XLS, and XLSB files can contain macros, which Microsoft Excel will ingest.
XLS is a deprecated format.
XLSB is not fully supported in Pandas.
HDF may not work on all platforms yet due to a tables issue.

🔒 Security

Refer to the security policy.

📝 Extra notes

Dependencies in the extras are only restricted to minimum version numbers; libraries that use them can set their own version ranges. For example, typed-dfs only requires tables >= 0.4, but Pandas can further restrict it. natsort is also only assigned a minimum version number; this is because it receives frequent major version bumps. This means that the result of typed-df’s sort_natural could change. To fix this, pin natsort to a specific major version; e.g. natsort = "^7" with Poetry or natsort>=7,<8 with pip.

🍁 Contributing

Typed-Dfs is licensed under the Apache License, version 2.0. New issues and pull requests are welcome. Please refer to the contributing guide. Generated with Tyrannosaurus.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

0.17.0a0 pre-release

Oct 24, 2023

0.16.5

Mar 1, 2022

0.16.4

Nov 7, 2021

0.16.3

Nov 6, 2021

0.16.2

Nov 5, 2021

0.16.1

Nov 1, 2021

0.16.0

Oct 24, 2021

0.15.0

Oct 17, 2021

0.14.4

Oct 13, 2021

0.14.3

Oct 6, 2021

0.14.2

Oct 5, 2021

0.14.1

Sep 23, 2021

0.14.0

Sep 23, 2021

0.13.3

Sep 18, 2021

0.13.2

Sep 13, 2021

0.13.0

Sep 12, 2021

0.12.0

Sep 8, 2021

0.11.0

Aug 24, 2021

This version

0.10.1

Aug 22, 2021

0.10.0

Aug 22, 2021

0.9.0

Aug 4, 2021

0.8.1

Aug 4, 2021

0.8.0

Aug 2, 2021

0.7.1

Jul 20, 2021

0.7.0

Jun 9, 2021

0.6.1

Apr 1, 2021

0.6.0

Mar 31, 2021

0.5.0

Feb 4, 2021

0.4.0

Jan 6, 2021

0.3.0

Aug 30, 2020

0.2.2

Aug 26, 2020

0.2.1

Aug 9, 2020

0.2.0

May 20, 2020

0.1.0

May 13, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

typeddfs-0.10.1.tar.gz (48.2 kB view hashes)

Uploaded Aug 22, 2021 Source

Built Distribution

typeddfs-0.10.1-py3-none-any.whl (49.5 kB view hashes)

Uploaded Aug 22, 2021 Python 3

Hashes for typeddfs-0.10.1.tar.gz

Hashes for typeddfs-0.10.1.tar.gz
Algorithm	Hash digest
SHA256	`dda8a192691cf19d38a942b07562a0f26c24c141caa43f668dec624a0714b252`
MD5	`25ce7abead90fd086625e7c5690cd872`
BLAKE2b-256	`72784ccee1d513adf689b98259002b016df20aee4c65926cff52f6f1e1c20e36`

Hashes for typeddfs-0.10.1-py3-none-any.whl

Hashes for typeddfs-0.10.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5ba44a7baa94291aad09835850fbebd300656947180001dfea739bf7733bdbab`
MD5	`6329300611ef9ca2b0e4c621904d9b28`
BLAKE2b-256	`9bc661bff3e281515461b7cdcfc87f7fb44dc0c93ec56c3ba8627ec47d2949cf`