Pandas DataFrame subclasses that enforce structure and can self-organize.
Project description
Typed DataFrames
Pandas DataFrame subclasses that self-organize and serialize robustly.
Film = TypedDfs.typed("Film").require("name", "studio", "year").build()
df = Film.read_csv("file.csv")
assert df.columns.tolist() == ["name", "studio", "year"]
type(df) # Film
Your types remember how to be read, including columns, dtypes, indices, and custom requirements. No index_cols=, header=, set_index, or astype needed.
Read and write any format:
path = input("input file? [.csv/.tsv/.tab/.json/.xml.bz2/.feather/.snappy.h5/...]")
df = Film.read_file(path)
Need dataclasses?
instances = df.to_dataclass_instances()
Film.from_dataclass_instances(instances)
Save metadata?
df = df.set_attrs(dataset="piano")
df.write_file("df.csv", attrs=True)
df = Film.read_file("df.csv", attrs=True)
print(df.attrs) # e.g. {"dataset": "piano")
Make dirs? Don't overwrite?
df.write_file("df.csv", mkdirs=True, overwrite=False)
Write and verify checksum?
df.write_file("df.csv", file_hash=True)
df = Film.read_file("df.csv", file_hash=True) # fails if wrong
Read the docs ๐ for more info and examples.
๐ Pandas serialization bugs fixed
Pandas has several issues with serialization. Depending on the format and columns, these issues occur:
- columns being silently added or dropped,
- errors on either read or write of empty DataFrames,
- the inability to use DataFrames with indices in Feather,
- writing to Parquet failing with half-precision,
- lingering partially written files on error,
- the buggy xlrd being preferred by read_excel,
- the buggy odfpy also being preferred,
- writing a file and reading it back results in a different DataFrame,
- you canโt write fixed-width format,
- and the platform text encoding being used rather than utf-8.
- invalid JSON is written via the built-in json library
All standard DataFrame methods remain available.
Use .untyped()
or .vanilla()
if needed, and .of(df)
for the inverse.
๐ Limitations
- Multi-level columns are not yet supported.
- Columns and index levels cannot share names.
- Duplicate column names are not supported. (These are strange anyway.)
- A typed DF cannot have columns "level_0", "index", or "Unnamed: 0".
inplace
is forbidden in some functions; avoid it or use.vanilla()
.
๐ Serialization support
Like Pandas, TypedDfs can read and write to various formats.
It provides the methods read_file
and write_file
, which guess the format from the
filename extension. For example, df.write_file("myfile.snappy)
writes Parquet files,
and df.write_file("myfile.tab.gz")
writes a gzipped, tab-delimited file.
The read_file
method works the same way: MyDf.read_file("myfile.feather")
will
read an Apache Arrow Feather file, and MyDf.read_file("myfile.json.gzip")
reads
a gzipped JSON file. You can pass keyword arguments to those functions.
Serialization is provided through Pandas, and some formats require additional packages. Pandas does not specify compatible versions, so typed-dfs specifies extras are provided in typed-dfs to ensure that those packages are installed with compatible versions.
- To install with Feather support,
use
pip install typeddfs[feather]
. - To install with support for all formats,
use
pip install typeddfs[all]
.
Feather offers massively better performance over CSV, gzipped CSV, and HDF5 in read speed, write speed, memory overhead, and compression ratios. Parquet typically results in smaller file sizes than Feather at some cost in speed. Feather is the preferred format for most cases.
๐ Serialization in-depth
โ Note: The hdf5
extra is currently disabled.
format | packages | extra | sanity | speed | file sizes |
---|---|---|---|---|---|
Feather | pyarrow |
feather |
+++ | ++++ | +++ |
Parquet | pyarrow or fastparquet โ |
parquet |
++ | +++ | ++++ |
csv/tsv | none | none | ++ | โโ | โโ |
flexwf โก | none | none | ++ | โโ | โโ |
.fwf | none | none | + | โโ | โโ |
json | none | none | โโ | โโโ | โโโ |
xml | lxml |
xml |
โ | โโโ | โโโ |
.properties | none | none | โโ | โโ | โโ |
toml | tomlkit |
toml |
โโ | โโ | โโ |
INI | none | none | โโโ | โโ | โโ |
.lines | none | none | ++ | โโ | โโ |
.npy | none | none | โ | + | +++ |
.npz | none | none | โ | + | +++ |
.html | html5lib,beautifulsoup4 |
html |
โโ | โโโ | โโโ |
pickle | none | none | โโ | โโโ | โโโ |
XLSX | openpyxl,defusedxml |
excel |
+ | โโ | + |
ODS | openpyxl,defusedxml |
excel |
+ | โโ | + |
XLS | openpyxl,defusedxml |
excel |
โโ | โโ | + |
XLSB | pyxlsb |
xlsb |
โโ | โโ | ++ |
HDF5 | tables |
hdf5 |
โโ | โ | ++ |
Notes:
- โ
fastparquet
can be used instead. It is slower but much smaller. - Parquet only supports str, float64, float32, int64, int32, and bool. Other numeric types are automatically converted during write.
- โก
.flexwf
is fixed-width with optional delimiters. - JSON has inconsistent handling of
None
. (orjson is more consistent). - XML requires Pandas 1.3+.
- Not all JSON, XML, TOML, and HDF5 files can be read.
- .ini and .properties can only be written with exactly 2 columns + index levels:
a key and a value. INI keys are in the form
section.name
. - .lines can only be written with exactly 1 column or index level.
- .npy and .npz only serialize numpy objects.
They are not supported in
read_file
andwrite_file
. - .html is not supported in
read_file
andwrite_file
. - Pickle is insecure and not recommended.
- Pandas supports odfpy for ODS and xlrd for XLS. In fact, it prefers those. However, they are very buggy; openpyxl is much better.
- XLSM, XLTX, XLTM, XLS, and XLSB files can contain macros, which Microsoft Excel will ingest.
- XLS is a deprecated format.
- XLSB is not fully supported in Pandas.
- HDF may not work on all platforms yet due to a tables issue.
๐ Security
Refer to the security policy.
๐ Extra notes
Dependencies in the extras are only restricted to minimum version numbers;
libraries that use them can set their own version ranges.
For example, typed-dfs only requires tables >= 0.4, but Pandas can further restrict it.
natsort is also only assigned a minimum version number;
this is because it receives frequent major version bumps.
This means that the result of typed-dfโs sort_natural
could change.
To fix this, pin natsort to a specific major version;
e.g. natsort = "^7"
with Poetry or natsort>=7,<8
with pip.
๐ Contributing
Typed-Dfs is licensed under the Apache License, version 2.0. New issues and pull requests are welcome. Please refer to the contributing guide. Generated with Tyrannosaurus.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for typeddfs-0.15.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c7ae4348f72fa77a1eaa5311caaa34f80c1d27ecbfc69b4c8a967207c2ea0884 |
|
MD5 | a81df65e5de872483d0e6063208168b2 |
|
BLAKE2b-256 | ecb7e811e9a8262ec8adfdc7832ee7b446e153426a645f8fe5f8afe6223dab5d |