A library to flatten nested data.
Project description
dataflat
A library to flatten nested keys and columns from Python Dictionaries, Pandas DataFrames, Polars DataFrames, PyArrow Tables, and PySpark DataFrames into a set of relational tables.
Installation
pip install dataflat
For Polars support:
pip install dataflat polars
For Pandas support:
pip install dataflat pandas pyarrow
For PyArrow support:
pip install dataflat pyarrow
For PySpark support:
pip install dataflat pyspark
Get started
Import handler, FlattenerOptions, and optionally CaseTranslatorOptions from dataflat:
from dataflat import handler, FlattenerOptions, CaseTranslatorOptions
FlattenerOptions selects the backend:
| Option | Input type |
|---|---|
FlattenerOptions.DICTIONARY |
Python dict |
FlattenerOptions.PANDAS_DF |
Pandas DataFrame |
FlattenerOptions.POLARS_DF |
Polars DataFrame |
FlattenerOptions.PYARROW_TABLE |
PyArrow Table |
FlattenerOptions.PYSPARK_DF |
PySpark DataFrame |
CaseTranslatorOptions (optional) translates key / column names after flattening:
| Option | Example output |
|---|---|
SNAKE |
order_id |
CAMEL |
orderId |
PASCAL |
OrderId |
KEBAB |
order-id |
HUMAN |
Order id |
LOWER |
orderid |
Instantiate a flattener
Use handler() to get a configured flattener instance:
from dataflat import handler, FlattenerOptions, CaseTranslatorOptions
flattener = handler(
custom_flattener=FlattenerOptions.DICTIONARY, # or POLARS_DF / PYSPARK_DF
from_case=CaseTranslatorOptions.CAMEL, # optional
to_case=CaseTranslatorOptions.SNAKE, # optional
remove_special_chars=False, # strip non-alphanumeric chars (default False)
)
Or import the concrete class directly if you prefer:
from dataflat.dictionary import CustomFlattener # dict
from dataflat.pandas import CustomFlattener # Pandas
from dataflat.polars import CustomFlattener # Polars
from dataflat.pyarrow import CustomFlattener # PyArrow
from dataflat.pyspark import CustomFlattener # PySpark
Flatten data
All flatteners share the same flatten() signature:
flatten_data = flattener.flatten(
data=data, # dict / Polars / PySpark DataFrame
primary_key="id", # optional; auto-generates UUID column when omitted
entity_name="data", # root entity name, default "data"
partition_keys=["date"], # columns propagated to all children
black_list=["keys.to", "be.ignored"], # fields excluded from output
)
Parameters
primary_key— identifies each root record; propagated to all child entities as<entity_name>.<primary_key>(e.g.data.id). WhenNone(the default), a UUID column is auto-generated with a name that matchesto_case:dataflat_id_column(SNAKE),dataflat-id-column(KEBAB),dataflatIdColumn(CAMEL),DataflatIdColumn(PASCAL),Dataflat id column(HUMAN),dataflatidcolumn(LOWER).partition_keys— additional root columns (e.g.["date"]) inherited by every child entity.black_list— dot-joined field paths excluded from all output (e.g.["summary.totalClients"]).
Return value
flatten() returns dict[str, entity] — one key per entity, named by dot-joined path.
Nested fields are handled as follows:
- Struct / object fields are expanded inline using dot-notation (
payment.method,client.address.city). - Lists of objects produce a child entity (e.g.
data.orders.products) with a positionalindexcolumn (0-based, per-parent). - Lists of scalars (strings, ints, floats, booleans) are joined with
"|"into a single string column in the parent entity — no child entity is created.
{
"data": [{"id": 1, "date": "2024-01-01", "tags": "featured|sale", "total": 1900}],
"data.orders": [
{"id": "abc123", "total": 700, "notes": "fragile|urgent", "data.id": 1, "data.date": "2024-01-01", "index": 0},
{"id": "dfg456", "total": 1200, "notes": "gift", "data.id": 1, "data.date": "2024-01-01", "index": 1}
],
"data.orders.products": [
{"id": "ab", "price": 200, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 0, "index": 0},
{"id": "cd", "price": 500, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 0, "index": 1},
{"id": "fg", "price": 1200, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 1, "index": 0}
]
}
Produced from:
{
"id": 1,
"date": "2024-01-01",
"orders": [
{
"id": "abc123",
"products": [
{"id": "ab", "price": 200},
{"id": "cd", "price": 500}
],
"total": 700
},
{
"id": "dfg456",
"products": [
{"id": "fg", "price": 1200}
],
"total": 1200
}
],
"total": 1900
}
Recommendations
-
For
PYSPARK_DFit is recommended to enable case-sensitive mode in Spark:spark.conf.set("spark.sql.caseSensitive", True)
-
For
POLARS_DFwith nullable struct columns, load data viapl.read_ndjsoninstead ofpl.from_dicts— the latter does not handle nullable structs correctly.import polars as pl df = pl.read_ndjson("data.ndjson")
-
For
PANDAS_DF, the flattener internally converts the DataFrame to a PyArrow Table to reliably distinguishstr,dict, andlistcolumns (thenumpy_nullabledtype backend marks all three asobject). The result is returned with thepyarrowdtype backend (pd.ArrowDtype), giving properly nullable typed columns. -
For
PYARROW_TABLEwith single JSON objects (not NDJSON), usepa.Table.from_pylist:import json import pyarrow as pa with open("data.json") as f: data = json.load(f) table = pa.Table.from_pylist([data])
For NDJSON or multiple records use
pyarrow.json.read_json:import pyarrow.json as pa_json table = pa_json.read_json("data.ndjson")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataflat-3.0.0.tar.gz.
File metadata
- Download URL: dataflat-3.0.0.tar.gz
- Upload date:
- Size: 134.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
daedca7bd95e6fb2cfa290257e6ad933b2a3b1703192a82986eda66e3fa07b26
|
|
| MD5 |
fa5e4388edb47bc6ad0801290d422162
|
|
| BLAKE2b-256 |
ca2a78b786fce0dd3d02fbace8efd0c6ada7c6558d111dcfa313ff0051d56d81
|
File details
Details for the file dataflat-3.0.0-py3-none-any.whl.
File metadata
- Download URL: dataflat-3.0.0-py3-none-any.whl
- Upload date:
- Size: 36.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01a4d22af83aa1f53a65fa857dce747f3c59e9fed3a3501a405903d8bab046e3
|
|
| MD5 |
f1ea2a856de3826999f53939a1600d39
|
|
| BLAKE2b-256 |
25fdd277335aba89e48947235bb94a309957acdaa9a9a7c153d2907e4fb74ab6
|