Skip to main content

A library to flatten nested data.

Project description

dataflat

A library to flatten nested keys and columns from Python Dictionaries, Pandas DataFrames, Polars DataFrames, PyArrow Tables, and PySpark DataFrames into a set of relational tables.

Installation

pip install dataflat

For Polars support:

pip install dataflat polars

For Pandas support:

pip install dataflat pandas pyarrow

For PyArrow support:

pip install dataflat pyarrow

For PySpark support:

pip install dataflat pyspark

Get started

Import handler, FlattenerOptions, and optionally CaseTranslatorOptions from dataflat:

from dataflat import handler, FlattenerOptions, CaseTranslatorOptions

FlattenerOptions selects the backend:

Option Input type
FlattenerOptions.DICTIONARY Python dict
FlattenerOptions.PANDAS_DF Pandas DataFrame
FlattenerOptions.POLARS_DF Polars DataFrame
FlattenerOptions.PYARROW_TABLE PyArrow Table
FlattenerOptions.PYSPARK_DF PySpark DataFrame

CaseTranslatorOptions (optional) translates key / column names after flattening:

Option Example output
SNAKE order_id
CAMEL orderId
PASCAL OrderId
KEBAB order-id
HUMAN Order id
LOWER orderid

Instantiate a flattener

Use handler() to get a configured flattener instance:

from dataflat import handler, FlattenerOptions, CaseTranslatorOptions

flattener = handler(
    custom_flattener=FlattenerOptions.DICTIONARY,  # or POLARS_DF / PYSPARK_DF
    from_case=CaseTranslatorOptions.CAMEL,         # optional
    to_case=CaseTranslatorOptions.SNAKE,           # optional
    remove_special_chars=False,                    # strip non-alphanumeric chars (default False)
)

Or import the concrete class directly if you prefer:

from dataflat.dictionary import CustomFlattener   # dict
from dataflat.pandas import CustomFlattener       # Pandas
from dataflat.polars import CustomFlattener       # Polars
from dataflat.pyarrow import CustomFlattener      # PyArrow
from dataflat.pyspark import CustomFlattener      # PySpark

Flatten data

All flatteners share the same flatten() signature:

flatten_data = flattener.flatten(
    data=data,                                      # dict / Polars / PySpark DataFrame
    primary_key="id",                               # optional; auto-generates UUID column when omitted
    entity_name="data",                             # root entity name, default "data"
    partition_keys=["date"],                        # columns propagated to all children
    black_list=["keys.to", "be.ignored"],           # fields excluded from output
    white_list=["orders.items", "orders.items.name"],  # entities/columns to retain (see below)
)

Parameters

  • primary_key — identifies each root record; propagated to all child entities as <entity_name>.<primary_key> (e.g. data.id). When None (the default), a UUID column is auto-generated with a name that matches to_case: dataflat_id_column (SNAKE), dataflat-id-column (KEBAB), dataflatIdColumn (CAMEL), DataflatIdColumn (PASCAL), Dataflat id column (HUMAN), dataflatidcolumn (LOWER).
  • partition_keys — additional root columns (e.g. ["date"]) inherited by every child entity.
  • black_list — dot-joined field paths excluded from all output (e.g. ["summary.totalClients"]).
  • white_list — dot-joined paths that select which entities and/or columns to retain after flattening (see White list).

Return value

flatten() returns dict[str, entity] — one key per entity, named by dot-joined path.

Nested fields are handled as follows:

  • Struct / object fields are expanded inline using dot-notation (payment.method, client.address.city).
  • Lists of objects produce a child entity (e.g. data.orders.products) with a positional index column (0-based, per-parent).
  • Lists of scalars (strings, ints, floats, booleans) are joined with "|" into a single string column in the parent entity — no child entity is created.
{
  "data": [{"id": 1, "date": "2024-01-01", "tags": "featured|sale", "total": 1900}],
  "data.orders": [
    {"id": "abc123", "total": 700, "notes": "fragile|urgent", "data.id": 1, "data.date": "2024-01-01", "index": 0},
    {"id": "dfg456", "total": 1200, "notes": "gift", "data.id": 1, "data.date": "2024-01-01", "index": 1}
  ],
  "data.orders.products": [
    {"id": "ab", "price": 200, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 0, "index": 0},
    {"id": "cd", "price": 500, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 0, "index": 1},
    {"id": "fg", "price": 1200, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 1, "index": 0}
  ]
}

Produced from:

{
  "id": 1,
  "date": "2024-01-01",
  "orders": [
    {
      "id": "abc123",
      "products": [
        {"id": "ab", "price": 200},
        {"id": "cd", "price": 500}
      ],
      "total": 700
    },
    {
      "id": "dfg456",
      "products": [
        {"id": "fg", "price": 1200}
      ],
      "total": 1200
    }
  ],
  "total": 1900
}

White list

white_list filters the flattened output after all unnesting and before case translation. Entries are relative paths — entity_name (default "data") is prepended automatically.

Entity-level — if the entry matches an existing entity key, that entity and all its descendants are kept with every column:

results = flattener.flatten(
    data, primary_key="id", partition_keys=["date"],
    white_list=["orders.items", "orders.client.addresses"],
)
# Keeps: data.orders.items, data.orders.items.attributes, data.orders.client.addresses
# Drops: data, data.orders, data.orders.client, ...

Column-level — if the entry does not match any entity key, the flattener finds the entity whose path is a prefix of the entry, retains that entity narrowed to the specified column plus all inherited join columns (pk, partition keys, index columns), and drops all child entities of it:

results = flattener.flatten(
    data, primary_key="id", partition_keys=["date"],
    white_list=["orders.items.name", "orders.items.price", "summary.total_revenue"],
)
# Keeps: data (narrowed to id + date + summary.total_revenue)
#        data.orders.items (narrowed to data.id + data.date + data.orders.index + index + name + price)
# Drops: data.orders, data.orders.items.attributes, data.orders.client, ...

Rules:

  • Multiple column entries for the same entity are additive.
  • Entity-level overrides column-level for the same entity (full entity is kept).
  • Inherited join columns (pk, partition keys, index columns) are always preserved, even under column-level filtering.
  • An empty white_list (the default) keeps everything.

Recommendations

  1. For PYSPARK_DF it is recommended to enable case-sensitive mode in Spark:

    spark.conf.set("spark.sql.caseSensitive", True)
    
  2. For POLARS_DF with nullable struct columns, load data via pl.read_ndjson instead of pl.from_dicts — the latter does not handle nullable structs correctly.

    import polars as pl
    df = pl.read_ndjson("data.ndjson")
    
  3. For PANDAS_DF, the flattener internally converts the DataFrame to a PyArrow Table to reliably distinguish str, dict, and list columns (the numpy_nullable dtype backend marks all three as object). The result is returned with the pyarrow dtype backend (pd.ArrowDtype), giving properly nullable typed columns.

  4. For PYARROW_TABLE with single JSON objects (not NDJSON), use pa.Table.from_pylist:

    import json
    import pyarrow as pa
    with open("data.json") as f:
        data = json.load(f)
    table = pa.Table.from_pylist([data])
    

    For NDJSON or multiple records use pyarrow.json.read_json:

    import pyarrow.json as pa_json
    table = pa_json.read_json("data.ndjson")
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataflat-3.1.0.tar.gz (137.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataflat-3.1.0-py3-none-any.whl (38.8 kB view details)

Uploaded Python 3

File details

Details for the file dataflat-3.1.0.tar.gz.

File metadata

  • Download URL: dataflat-3.1.0.tar.gz
  • Upload date:
  • Size: 137.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataflat-3.1.0.tar.gz
Algorithm Hash digest
SHA256 cc3c004967313a1de2aeccfae3b7a8969d19cb157b874f18a847645e18ca393b
MD5 bd46e150f486581923b41b05eb90e9b5
BLAKE2b-256 39b471ccd34c309471d63dcd6e64b54efae78202545913b0c424f47e66014d1c

See more details on using hashes here.

File details

Details for the file dataflat-3.1.0-py3-none-any.whl.

File metadata

  • Download URL: dataflat-3.1.0-py3-none-any.whl
  • Upload date:
  • Size: 38.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataflat-3.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7aa1f577f4405148602ac5ab36ae4c1c6aaeb5770396c92fcf41490086985c5a
MD5 db98795f1a75b460215d0071e42c525a
BLAKE2b-256 3d52e83562dab2391a3bee864f4eccf12ae2333f0fde8a4f68ede3e687eec734

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page