Skip to main content

A library to flatten nested data.

Project description

dataflat

A library to flatten nested keys and columns from Python Dictionaries, Pandas DataFrames, Polars DataFrames, PyArrow Tables, and PySpark DataFrames into a set of relational tables.

Installation

pip install dataflat

For Polars support:

pip install dataflat polars

For Pandas support:

pip install dataflat pandas pyarrow

For PyArrow support:

pip install dataflat pyarrow

For PySpark support:

pip install dataflat pyspark

Get started

Import handler, FlattenerOptions, and optionally CaseTranslatorOptions from dataflat:

from dataflat import handler, FlattenerOptions, CaseTranslatorOptions

FlattenerOptions selects the backend:

Option Input type
FlattenerOptions.DICTIONARY Python dict
FlattenerOptions.PANDAS_DF Pandas DataFrame
FlattenerOptions.POLARS_DF Polars DataFrame
FlattenerOptions.PYARROW_TABLE PyArrow Table
FlattenerOptions.PYSPARK_DF PySpark DataFrame

CaseTranslatorOptions (optional) translates key / column names after flattening:

Option Example output
SNAKE order_id
CAMEL orderId
PASCAL OrderId
KEBAB order-id
HUMAN Order id
LOWER orderid

Instantiate a flattener

Use handler() to get a configured flattener instance:

from dataflat import handler, FlattenerOptions, CaseTranslatorOptions

flattener = handler(
    custom_flattener=FlattenerOptions.DICTIONARY,  # or POLARS_DF / PYSPARK_DF
    from_case=CaseTranslatorOptions.CAMEL,         # optional
    to_case=CaseTranslatorOptions.SNAKE,           # optional
    remove_special_chars=False,                    # strip non-alphanumeric chars (default False)
)

Or import the concrete class directly if you prefer:

from dataflat.dictionary import CustomFlattener   # dict
from dataflat.pandas import CustomFlattener       # Pandas
from dataflat.polars import CustomFlattener       # Polars
from dataflat.pyarrow import CustomFlattener      # PyArrow
from dataflat.pyspark import CustomFlattener      # PySpark

Flatten data

All flatteners share the same flatten() signature:

flatten_data = flattener.flatten(
    data=data,                                      # dict / Polars / PySpark DataFrame
    primary_key="id",                               # optional; auto-generates UUID column when omitted
    entity_name="data",                             # root entity name, default "data"
    partition_keys=["date"],                        # columns propagated to all children
    black_list=["keys.to", "be.ignored"],           # fields excluded from output
)

Parameters

  • primary_key — identifies each root record; propagated to all child entities as <entity_name>.<primary_key> (e.g. data.id). When None (the default), a UUID column is auto-generated with a name that matches to_case: dataflat_id_column (SNAKE), dataflat-id-column (KEBAB), dataflatIdColumn (CAMEL), DataflatIdColumn (PASCAL), Dataflat id column (HUMAN), dataflatidcolumn (LOWER).
  • partition_keys — additional root columns (e.g. ["date"]) inherited by every child entity.
  • black_list — dot-joined field paths excluded from all output (e.g. ["summary.totalClients"]).

Return value

flatten() returns dict[str, entity] — one key per entity, named by dot-joined path.

Nested fields are handled as follows:

  • Struct / object fields are expanded inline using dot-notation (payment.method, client.address.city).
  • Lists of objects produce a child entity (e.g. data.orders.products) with a positional index column (0-based, per-parent).
  • Lists of scalars (strings, ints, floats, booleans) are joined with "|" into a single string column in the parent entity — no child entity is created.
{
  "data": [{"id": 1, "date": "2024-01-01", "tags": "featured|sale", "total": 1900}],
  "data.orders": [
    {"id": "abc123", "total": 700, "notes": "fragile|urgent", "data.id": 1, "data.date": "2024-01-01", "index": 0},
    {"id": "dfg456", "total": 1200, "notes": "gift", "data.id": 1, "data.date": "2024-01-01", "index": 1}
  ],
  "data.orders.products": [
    {"id": "ab", "price": 200, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 0, "index": 0},
    {"id": "cd", "price": 500, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 0, "index": 1},
    {"id": "fg", "price": 1200, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 1, "index": 0}
  ]
}

Produced from:

{
  "id": 1,
  "date": "2024-01-01",
  "orders": [
    {
      "id": "abc123",
      "products": [
        {"id": "ab", "price": 200},
        {"id": "cd", "price": 500}
      ],
      "total": 700
    },
    {
      "id": "dfg456",
      "products": [
        {"id": "fg", "price": 1200}
      ],
      "total": 1200
    }
  ],
  "total": 1900
}

Recommendations

  1. For PYSPARK_DF it is recommended to enable case-sensitive mode in Spark:

    spark.conf.set("spark.sql.caseSensitive", True)
    
  2. For POLARS_DF with nullable struct columns, load data via pl.read_ndjson instead of pl.from_dicts — the latter does not handle nullable structs correctly.

    import polars as pl
    df = pl.read_ndjson("data.ndjson")
    
  3. For PANDAS_DF, the flattener internally converts the DataFrame to a PyArrow Table to reliably distinguish str, dict, and list columns (the numpy_nullable dtype backend marks all three as object). The result is returned with the pyarrow dtype backend (pd.ArrowDtype), giving properly nullable typed columns.

  4. For PYARROW_TABLE with single JSON objects (not NDJSON), use pa.Table.from_pylist:

    import json
    import pyarrow as pa
    with open("data.json") as f:
        data = json.load(f)
    table = pa.Table.from_pylist([data])
    

    For NDJSON or multiple records use pyarrow.json.read_json:

    import pyarrow.json as pa_json
    table = pa_json.read_json("data.ndjson")
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataflat-3.0.0.tar.gz (134.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataflat-3.0.0-py3-none-any.whl (36.4 kB view details)

Uploaded Python 3

File details

Details for the file dataflat-3.0.0.tar.gz.

File metadata

  • Download URL: dataflat-3.0.0.tar.gz
  • Upload date:
  • Size: 134.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataflat-3.0.0.tar.gz
Algorithm Hash digest
SHA256 daedca7bd95e6fb2cfa290257e6ad933b2a3b1703192a82986eda66e3fa07b26
MD5 fa5e4388edb47bc6ad0801290d422162
BLAKE2b-256 ca2a78b786fce0dd3d02fbace8efd0c6ada7c6558d111dcfa313ff0051d56d81

See more details on using hashes here.

File details

Details for the file dataflat-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: dataflat-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 36.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataflat-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 01a4d22af83aa1f53a65fa857dce747f3c59e9fed3a3501a405903d8bab046e3
MD5 f1ea2a856de3826999f53939a1600d39
BLAKE2b-256 25fdd277335aba89e48947235bb94a309957acdaa9a9a7c153d2907e4fb74ab6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page