pandas data creation made easy by dataclass
Project description
pandas-dataclasses
pandas data creation made easy by dataclass
Overview
pandas-dataclass makes it easy to create pandas data (Series and DataFrame) by Python's dataclass that enables to specify their data types, attributes, and names:
Click to see all imports
from dataclasses import dataclass
from pandas_dataclasses import AsDataFrame, Data, Index
@dataclass
class Weather(AsDataFrame):
"""Weather information."""
year: Index[int]
month: Index[int]
temp: Data[float]
humid: Data[float]
df = Weather.new(
[2020, 2020, 2021, 2021, 2022],
[1, 7, 1, 7, 1],
[7.1, 24.3, 5.4, 25.9, 4.9],
[65, 89, 57, 83, 52],
)
where df
will become a DataFrame object like:
temp humid
year month
2020 1 7.1 65.0
7 24.3 89.0
2021 1 5.4 57.0
7 25.9 83.0
2022 1 4.9 52.0
Features
- Type specification of pandas indexes and data
- Metadata storing in pandas data attributes
- Support for hierarchical index and columns
- Support for full dataclass features
- Support for static type check by Pyright (Pylance)
Installation
pip install pandas-dataclasses
How it works
pandas-dataclasses provides you the following features:
- Type hints for dataclass fields (
Attr
,Data
,Index
,Name
) to specify index(es), data, attributes, and names of pandas data - Mix-in classes for dataclasses (
As
,AsDataFrame
,AsSeries
) to create pandas data by a classmethod (new
) that takes the same arguments as dataclass initialization
When you call new
, it will first create a dataclass object and then create a Series or DataFrame object from the dataclass object according the type hints and values in it.
In the example above, df = Weather.new(...)
is thus equivalent to:
obj = Weather([2020, ...], [1, ...], [7.1, ...], [65, ...])
df = asdataframe(obj)
where asdataframe
is a conversion function.
pandas-dataclasses does not touch the dataclass object creation itself; this allows you to fully customize your dataclass before conversion by the dataclass features (field
, __post_init__
, ...).
Basic usage
DataFrame creation
As shown in the example above, a dataclass that has the AsDataFrame
mix-in will create DataFrame objects:
Click to see all imports
from dataclasses import dataclass
from pandas_dataclasses import AsDataFrame, Data, Index
@dataclass
class Weather(AsDataFrame):
"""Weather information."""
year: Index[int]
month: Index[int]
temp: Data[float]
humid: Data[float]
df = Weather.new(...)
where fields typed by Index
are "index fields", each value of which will become an index or a part of a hierarchical index of a DataFrame object.
Fields typed by Data
are "data fields", each value of which will become a data column of a DataFrame object.
Fields typed by other types are just ignored in the DataFrame creation.
Each data or index will be cast to the data type specified in a type hint like Index[int]
.
Use Any
or None
(like Index[Any]
) if you do not want type casting.
See data typing rules for more examples.
By default, a field name (i.e. an argument name) is used for the name of corresponding data or index. See custom data/index naming if you want customization.
Series creation
A dataclass that has the AsSeries
mix-in will create Series objects:
Click to see all imports
from dataclasses import dataclass
from pandas_dataclasses import AsSeries, Data, Index
@dataclass
class Temperature(AsSeries):
"""Temperature information."""
year: Index[int]
month: Index[int]
temp: Data[float]
ser = Temperature.new(...)
Unlike AsDataFrame
, the second and subsequent data fields are ignored in the Series creation.
Other rules are the same as for the DataFrame creation.
Advanced usage
Metadata storing
Fields typed by Attr
are "attribute fields", each value of which will become an item of attributes (attrs
) of a DataFrame or a Series object:
Click to see all imports
from dataclasses import dataclass
from pandas_dataclasses import AsDataFrame, Attr, Data, Index
@dataclass
class Weather(AsDataFrame):
"""Weather information."""
year: Index[int]
month: Index[int]
temp: Data[float]
humid: Data[float]
loc: Attr[str] = "Tokyo"
lon: Attr[float] = 139.69167
lat: Attr[float] = 35.68944
In this example, Weather.new(...).attrs
will become like:
{"loc": "Tokyo", "lon": 139.69167, "lat": 35.68944}
Custom naming
The name of data, index, or attribute can be explicitly specified by adding a string annotation to the corresponding type:
Click to see all imports
from dataclasses import dataclass
from typing import Annotated as Ann
from pandas_dataclasses import AsDataFrame, Attr, Data, Index
@dataclass
class Weather(AsDataFrame):
"""Weather information."""
year: Ann[Index[int], "Year"]
month: Ann[Index[int], "Month"]
temp: Ann[Data[float], "Temperature (deg C)"]
humid: Ann[Data[float], "Humidity (%)"]
loc: Ann[Attr[str], "Location"] = "Tokyo"
lon: Ann[Attr[float], "Longitude (deg)"] = 139.69167
lat: Ann[Attr[float], "Latitude (deg)"] = 35.68944
In this example, Weather.new(...)
and its attributes will become like:
Temperature (deg C) Humidity (%)
Year Month
2020 1 7.1 65.0
7 24.3 89.0
2021 1 5.4 57.0
7 25.9 83.0
2022 1 4.9 52.0
{"Location": "Tokyo", "Longitude (deg)": 139.69167, "Latitude (deg)": 35.68944}
Adding dictionary annotations to data fields will create DataFrame objects with hierarchical columns, where dictionary keys will become the names of column levels and dictionary values will become the names of columns:
Click to see all imports
from dataclasses import dataclass
from typing import Annotated as Ann
from pandas_dataclasses import AsDataFrame, Data, Index
def name(stat: str, cat: str) -> dict[str, str]:
return {"Statistic": stat, "Category": cat}
@dataclass
class Weather(AsDataFrame):
"""Weather information."""
year: Ann[Index[int], "Year"]
month: Ann[Index[int], "Month"]
temp_avg: Ann[Data[float], name("Temperature (degC)", "Average")]
temp_max: Ann[Data[float], name("Temperature (degC)", "Maximum")]
wind_avg: Ann[Data[float], name("Wind speed (m/s)", "Average")]
wind_max: Ann[Data[float], name("Wind speed (m/s)", "Maximum")]
In this example, Weather.new(...)
will become like:
Statistic Temperature (degC) Wind speed (m/s)
Category Average Maximum Average Maximum
Year Month
2020 1 7.1 11.1 2.4 8.8
7 24.3 27.7 3.1 10.2
2021 1 5.4 10.3 2.3 10.7
7 25.9 30.3 2.4 9.0
2022 1 4.9 9.4 2.6 8.8
For the Series creation, a field typed by Name
is a "name field", whose value will become the name of a Series object.
This is useful for dynamic naming.
See also naming rules for more details and examples.
Custom pandas factory
A custom class can be specified as a factory for the Series or DataFrame creation by As
, the generic version of AsDataFrame
and AsSeries
.
Note that the custom class must be a subclass of either pandas.Series
or pandas.DataFrame
:
Click to see all imports
import pandas as pd
from dataclasses import dataclass
from pandas_dataclasses import As, Data, Index
class CustomSeries(pd.Series):
"""Custom pandas Series."""
pass
@dataclass
class Temperature(As[CustomSeries]):
"""Temperature information."""
year: Index[int]
month: Index[int]
temp: Data[float]
ser = Temperature.new(...)
isinstance(ser, CustomSeries) # True
Appendix
Data typing rules
The data type (dtype) of data/index is inferred from the first Data
/Index
type of the corresponding field.
The following table shows how the data type is inferred:
Click to see all imports
from typing import Any, Annotated as Ann, Literal as L
from pandas_dataclasses import Data
Type hint | Inferred data type |
---|---|
Data[Any] |
None (no type casting) |
Data[None] |
None (no type casting) |
Data[int] |
numpy.dtype("i8") |
Data[numpy.int32] |
numpy.dtype("i4") |
Data[L["datetime64[ns]"]] |
numpy.dtype("<M8[ns]") |
Data[L["category"]] |
pandas.CategoricalDtype() |
Data[int] | str |
numpy.dtype("i8") |
Data[int] | Data[float] |
numpy.dtype("i8") |
Ann[Data[int], "spam"] |
numpy.dtype("i8") |
Data[Ann[int, "spam"]] |
numpy.dtype("i8") |
Naming rules
The name of data/index is determined by the following rules:
- If a name field exists, its value will be preferentially used (Series creation only)
- If a data/index field is annotated, the first annotation in the first
Data
/Index
type will be used - Otherwise, the field name (i.e. argument name) will be used
The following table shows how the name is inferred in the case of 2 and 3:
Click to see all imports
from typing import Any, Annotated as Ann
from pandas_dataclasses import Data
Type hint | Inferred name |
---|---|
Data[Any] |
Field name |
Ann[Data[Any], "spam"] |
"spam" |
Ann[Data[Any], "spam"] |
"spam" |
Ann[Data[Any], "spam", "ham"] |
"spam" |
Ann[Data[Any], "spam"] | Ann[str, "ham"] |
"spam" |
Ann[Data[Any], "spam"] | Ann[Data[float], "ham"] |
"spam" |
Ann[Data[Any], {"0": "spam", "1": "ham"}] |
("spam", "ham") |
Development roadmap
Release version | Features |
---|---|
v0.4.0 | Support for hierarchical column |
v0.5.0 | Support for dynamic naming of indexes and data |
v1.0.0 | Initial major release (freezing public features until v2.0.0) |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pandas_dataclasses-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0406434e5b9c14d27c451fe8f93db8f7b67eeef153fc6861dee5623134e5b2a2 |
|
MD5 | 9747c8752a3248a6c67ee353b7b9be62 |
|
BLAKE2b-256 | 6b41e414f25dca8c97f2cde0d80579445b6fa1ceba7590523ffeb1ec2b865b39 |