A library to flatten nested data.
Project description
dataflat
A library to flatten nested keys and columns from Python Dictionaries, Pandas DataFrames, Polars DataFrames, PyArrow Tables, and PySpark DataFrames into a set of relational tables.
Installation
pip install dataflat
For Polars support:
pip install dataflat polars
For Pandas support:
pip install dataflat pandas pyarrow
For PyArrow support:
pip install dataflat pyarrow
For PySpark support:
pip install dataflat pyspark
Get started
Import handler, FlattenerOptions, and optionally CaseTranslatorOptions from dataflat:
from dataflat import handler, FlattenerOptions, CaseTranslatorOptions
FlattenerOptions selects the backend:
| Option | Input type |
|---|---|
FlattenerOptions.DICTIONARY |
Python dict |
FlattenerOptions.PANDAS_DF |
Pandas DataFrame |
FlattenerOptions.POLARS_DF |
Polars DataFrame |
FlattenerOptions.PYARROW_TABLE |
PyArrow Table |
FlattenerOptions.PYSPARK_DF |
PySpark DataFrame |
CaseTranslatorOptions (optional) translates key / column names after flattening:
| Option | Example output |
|---|---|
SNAKE |
order_id |
CAMEL |
orderId |
PASCAL |
OrderId |
KEBAB |
order-id |
HUMAN |
Order id |
LOWER |
orderid |
Instantiate a flattener
Use handler() to get a configured flattener instance:
from dataflat import handler, FlattenerOptions, CaseTranslatorOptions
flattener = handler(
custom_flattener=FlattenerOptions.DICTIONARY, # or POLARS_DF / PYSPARK_DF
from_case=CaseTranslatorOptions.CAMEL, # optional
to_case=CaseTranslatorOptions.SNAKE, # optional
remove_special_chars=False, # strip non-alphanumeric chars (default False)
)
Or import the concrete class directly if you prefer:
from dataflat.dictionary import CustomFlattener # dict
from dataflat.pandas import CustomFlattener # Pandas
from dataflat.polars import CustomFlattener # Polars
from dataflat.pyarrow import CustomFlattener # PyArrow
from dataflat.pyspark import CustomFlattener # PySpark
Flatten data
All flatteners share the same flatten() signature:
flatten_data = flattener.flatten(
data=data, # dict / Polars / PySpark DataFrame
primary_key="id", # optional; auto-generates UUID column when omitted
entity_name="data", # root entity name, default "data"
partition_keys=["date"], # columns propagated to all children
black_list=["keys.to", "be.ignored"], # fields excluded from output
white_list=["orders.items", "orders.items.name"], # entities/columns to retain (see below)
)
Parameters
primary_key— identifies each root record; propagated to all child entities as<entity_name>.<primary_key>(e.g.data.id). WhenNone(the default), a UUID column is auto-generated with a name that matchesto_case:dataflat_id_column(SNAKE),dataflat-id-column(KEBAB),dataflatIdColumn(CAMEL),DataflatIdColumn(PASCAL),Dataflat id column(HUMAN),dataflatidcolumn(LOWER).partition_keys— additional root columns (e.g.["date"]) inherited by every child entity.black_list— dot-joined field paths excluded from all output (e.g.["summary.totalClients"]).white_list— dot-joined paths that select which entities and/or columns to retain after flattening (see White list).
Return value
flatten() returns dict[str, entity] — one key per entity, named by dot-joined path.
Nested fields are handled as follows:
- Struct / object fields are expanded inline using dot-notation (
payment.method,client.address.city). - Lists of objects produce a child entity (e.g.
data.orders.products) with a positionalindexcolumn (0-based, per-parent). - Lists of scalars (strings, ints, floats, booleans) are joined with
"|"into a single string column in the parent entity — no child entity is created.
{
"data": [{"id": 1, "date": "2024-01-01", "tags": "featured|sale", "total": 1900}],
"data.orders": [
{"id": "abc123", "total": 700, "notes": "fragile|urgent", "data.id": 1, "data.date": "2024-01-01", "index": 0},
{"id": "dfg456", "total": 1200, "notes": "gift", "data.id": 1, "data.date": "2024-01-01", "index": 1}
],
"data.orders.products": [
{"id": "ab", "price": 200, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 0, "index": 0},
{"id": "cd", "price": 500, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 0, "index": 1},
{"id": "fg", "price": 1200, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 1, "index": 0}
]
}
Produced from:
{
"id": 1,
"date": "2024-01-01",
"orders": [
{
"id": "abc123",
"products": [
{"id": "ab", "price": 200},
{"id": "cd", "price": 500}
],
"total": 700
},
{
"id": "dfg456",
"products": [
{"id": "fg", "price": 1200}
],
"total": 1200
}
],
"total": 1900
}
White list
white_list filters the flattened output after all unnesting and before case translation. Entries are relative paths — entity_name (default "data") is prepended automatically.
Entity-level — if the entry matches an existing entity key, that entity and all its descendants are kept with every column:
results = flattener.flatten(
data, primary_key="id", partition_keys=["date"],
white_list=["orders.items", "orders.client.addresses"],
)
# Keeps: data.orders.items, data.orders.items.attributes, data.orders.client.addresses
# Drops: data, data.orders, data.orders.client, ...
Column-level — if the entry does not match any entity key, the flattener finds the entity whose path is a prefix of the entry, retains that entity narrowed to the specified column plus all inherited join columns (pk, partition keys, index columns), and drops all child entities of it:
results = flattener.flatten(
data, primary_key="id", partition_keys=["date"],
white_list=["orders.items.name", "orders.items.price", "summary.total_revenue"],
)
# Keeps: data (narrowed to id + date + summary.total_revenue)
# data.orders.items (narrowed to data.id + data.date + data.orders.index + index + name + price)
# Drops: data.orders, data.orders.items.attributes, data.orders.client, ...
Rules:
- Multiple column entries for the same entity are additive.
- Entity-level overrides column-level for the same entity (full entity is kept).
- Inherited join columns (pk, partition keys, index columns) are always preserved, even under column-level filtering.
- An empty
white_list(the default) keeps everything.
Recommendations
-
For
PYSPARK_DFit is recommended to enable case-sensitive mode in Spark:spark.conf.set("spark.sql.caseSensitive", True)
-
For
POLARS_DFwith nullable struct columns, load data viapl.read_ndjsoninstead ofpl.from_dicts— the latter does not handle nullable structs correctly.import polars as pl df = pl.read_ndjson("data.ndjson")
-
For
PANDAS_DF, the flattener internally converts the DataFrame to a PyArrow Table to reliably distinguishstr,dict, andlistcolumns (thenumpy_nullabledtype backend marks all three asobject). The result is returned with thepyarrowdtype backend (pd.ArrowDtype), giving properly nullable typed columns. -
For
PYARROW_TABLEwith single JSON objects (not NDJSON), usepa.Table.from_pylist:import json import pyarrow as pa with open("data.json") as f: data = json.load(f) table = pa.Table.from_pylist([data])
For NDJSON or multiple records use
pyarrow.json.read_json:import pyarrow.json as pa_json table = pa_json.read_json("data.ndjson")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataflat-3.1.0.tar.gz.
File metadata
- Download URL: dataflat-3.1.0.tar.gz
- Upload date:
- Size: 137.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc3c004967313a1de2aeccfae3b7a8969d19cb157b874f18a847645e18ca393b
|
|
| MD5 |
bd46e150f486581923b41b05eb90e9b5
|
|
| BLAKE2b-256 |
39b471ccd34c309471d63dcd6e64b54efae78202545913b0c424f47e66014d1c
|
File details
Details for the file dataflat-3.1.0-py3-none-any.whl.
File metadata
- Download URL: dataflat-3.1.0-py3-none-any.whl
- Upload date:
- Size: 38.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7aa1f577f4405148602ac5ab36ae4c1c6aaeb5770396c92fcf41490086985c5a
|
|
| MD5 |
db98795f1a75b460215d0071e42c525a
|
|
| BLAKE2b-256 |
3d52e83562dab2391a3bee864f4eccf12ae2333f0fde8a4f68ede3e687eec734
|