Databricks PySpark module to flatten nested spark dataframes, basically struct and array of struct till the specified level

These details have not been verified by PyPI

Project links

Homepage

Project description

flatten_spark_dataframe

A lightweight PySpark utility to recursively flatten deeply nested Spark DataFrames — automatically expanding StructType and ArrayType(StructType) columns into clean, top-level columns.

Why use this library?

Working with nested JSON data in PySpark is painful. A single deeply nested struct can require dozens of lines of manual col("a.b.c").alias(...) expressions, and arrays of structs need explicit explode() calls — all of which must be written by hand for every schema.

flatten_spark_dataframe solves this in one line:

Without this library	With this library
Manually write `.select()` / `.withColumn()` for every nested field	One function call: `flatten(df)`
Must know the full schema upfront	Automatically discovers all nested columns
Exploding arrays requires separate steps	Arrays of structs are exploded + flattened automatically
Renaming nested fields is tedious	Clean `parent_child` naming convention applied automatically
No control over flattening depth	Control exactly how many levels to flatten
Columns you want nested stay nested? Manual filtering.	Pass an `exclude_list` to skip specific columns

Installation

pip install flatten-spark-dataframe

Quick Start

import flatten_spark_dataframe

# Flatten everything (all levels)
flat_df = flatten_spark_dataframe.flatten(df)

# Flatten only 1 level deep
flat_df = flatten_spark_dataframe.flatten(df, flatten_till_level=1)

# Exclude specific columns from flattening
flat_df = flatten_spark_dataframe.flatten(df, exclude_list=["address", "metadata"])

Parameters

Parameter	Type	Default	Description
`df`	DataFrame	(required)	The input PySpark DataFrame with nested columns
`flatten_till_level`	`'complete'` or `int`	`'complete'`	`'complete'` flattens all levels; an integer limits the depth (e.g., `1` = one level only)
`exclude_list`	`list[str]`	`[]`	Column names (lowercase) to skip — these are kept nested in the output

Detailed Example

Sample Data (3 levels of nesting)

The schema below has struct inside struct (name.firstname.initial) and a top-level struct (country):

root
 |-- name: struct
 |    |-- firstname: struct        ← Level 1
 |    |    |-- initial: string     ← Level 2
 |    |    |-- actualname: string  ← Level 2
 |    |-- middlename: string       ← Level 1
 |    |-- lastname: string         ← Level 1
 |-- state: string
 |-- gender: string
 |-- country: struct
 |    |-- city: string             ← Level 1
 |    |-- street: string           ← Level 1

from pyspark.sql.types import StructType, StructField, StringType

data = [
    ((("A", "James"), None, "Smith"), "OH", "M", ("F", "Mike")),
    ((("B", "Anna"), "Rose", ""), "NY", "F", ("E", "Jen")),
    ((("C", "Julia"), "", "Williams"), "OH", "F", ("D", "Maria")),
    ((("D", "Maria"), "Anne", "Jones"), "NY", "M", ("C", "Julia")),
    ((("E", "Jen"), "Mary", "Brown"), "NY", "M", ("B", "Anna")),
    ((("F", "Mike"), "Mary", "Williams"), "OH", "M", ("A", "James")),
]

schema = StructType([
    StructField('name', StructType([
        StructField('firstname', StructType([
            StructField('initial', StringType(), True),
            StructField('actualname', StringType(), True),
        ])),
        StructField('middlename', StringType(), True),
        StructField('lastname', StringType(), True),
    ])),
    StructField('state', StringType(), True),
    StructField('gender', StringType(), True),
    StructField('country', StructType([
        StructField('city', StringType(), True),
        StructField('street', StringType(), True),
    ])),
])

df = spark.createDataFrame(data=data, schema=schema)

Example 1: Flatten completely (all levels)

import flatten_spark_dataframe

flat_df = flatten_spark_dataframe.flatten(df)
flat_df.show()

What happens internally:

Level 1 — name is expanded to name_firstname (still a struct), name_middlename, name_lastname. country is expanded to country_city, country_street.
Level 2 — name_firstname is expanded to name_firstname_initial, name_firstname_actualname.

Output:

+-----+------+---------------+-------------+------------+--------------+----------------------+--------------------------+
|state|gender|name_middlename|name_lastname|country_city|country_street|name_firstname_initial |name_firstname_actualname |
+-----+------+---------------+-------------+------------+--------------+----------------------+--------------------------+
|   OH|     M|           null|        Smith|           F|          Mike|                     A |                     James|
|   NY|     F|           Rose|             |           E|           Jen|                     B |                      Anna|
|   OH|     F|               |     Williams|           D|         Maria|                     C |                     Julia|
|   NY|     M|           Anne|        Jones|           C|         Julia|                     D |                     Maria|
|   NY|     M|           Mary|        Brown|           B|          Anna|                     E |                       Jen|
|   OH|     M|           Mary|     Williams|           A|        James |                     F |                      Mike|
+-----+------+---------------+-------------+------------+--------------+----------------------+--------------------------+

All nested structs have been fully flattened into 8 top-level columns.

Example 2: Flatten only 1 level deep

flat_df_l1 = flatten_spark_dataframe.flatten(df, flatten_till_level=1)
flat_df_l1.printSchema()

Output schema:

root
 |-- state: string
 |-- gender: string
 |-- name_firstname: struct       ← Still nested (would need level 2 to expand)
 |    |-- initial: string
 |    |-- actualname: string
 |-- name_middlename: string      ← Flattened from name.middlename
 |-- name_lastname: string        ← Flattened from name.lastname
 |-- country_city: string         ← Flattened from country.city
 |-- country_street: string       ← Flattened from country.street

Only the first level of structs is expanded. name_firstname remains a struct because it was at level 2.

Example 3: Flatten with exclusions

flat_df_excl = flatten_spark_dataframe.flatten(df, exclude_list=["country"])
flat_df_excl.printSchema()

Output schema:

root
 |-- country: struct              ← Kept nested (excluded)
 |    |-- city: string
 |    |-- street: string
 |-- state: string
 |-- gender: string
 |-- name_middlename: string
 |-- name_lastname: string
 |-- name_firstname_initial: string
 |-- name_firstname_actualname: string

The country struct is preserved as-is while everything else is fully flattened.

Example 4: Combine level control + exclusions

flat_df_combo = flatten_spark_dataframe.flatten(df, flatten_till_level=1, exclude_list=["country"])
flat_df_combo.printSchema()

Output schema:

root
 |-- country: struct              ← Excluded — kept nested
 |    |-- city: string
 |    |-- street: string
 |-- state: string
 |-- gender: string
 |-- name_firstname: struct       ← Level 2 — not flattened (limit = 1)
 |    |-- initial: string
 |    |-- actualname: string
 |-- name_middlename: string
 |-- name_lastname: string

How it works

Classifies columns into flat (primitives), struct, and array-of-struct categories
Expands structs into sub-fields using parent_child naming (special characters are cleaned)
Explodes arrays of structs using explode_outer() (preserves rows even when the array is null/empty)
Recurses until all levels are flattened or the depth limit is reached
Handles duplicates — if a flattened field name collides with an existing column, a suffix is appended

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.2

Mar 29, 2026

0.0.1

Aug 26, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flatten_spark_dataframe-0.0.2.tar.gz (7.9 kB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

flatten_spark_dataframe-0.0.2-py3-none-any.whl (7.8 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file flatten_spark_dataframe-0.0.2.tar.gz.

File metadata

Download URL: flatten_spark_dataframe-0.0.2.tar.gz
Upload date: Mar 29, 2026
Size: 7.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for flatten_spark_dataframe-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`fcc95a1772516e0157f8cfa50a09d252c094d932b05904f33e16c1646c8a2aab`
MD5	`983ada36b257b429b75e62238f35cf26`
BLAKE2b-256	`60c8fd8d3b8e342e4d7db2c78ed46bd6e1273b9eebe31aa19d746b443c05ec42`

See more details on using hashes here.

File details

Details for the file flatten_spark_dataframe-0.0.2-py3-none-any.whl.

File metadata

Download URL: flatten_spark_dataframe-0.0.2-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 7.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for flatten_spark_dataframe-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f9f69627f798f81bbfacdcfb2236846e20b657ce9b4fc8e276a4b2bfb395ef59`
MD5	`fcd2202f6bba03af8cdbd3af57aafa1a`
BLAKE2b-256	`ff05d9b862be4e21dc948ddf3faaacf1fb842f748555818a7767bf86bf4e1014`

See more details on using hashes here.

flatten-spark-dataframe 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

flatten_spark_dataframe

Why use this library?

Installation

Quick Start

Parameters

Detailed Example

Sample Data (3 levels of nesting)

Example 1: Flatten completely (all levels)

Example 2: Flatten only 1 level deep

Example 3: Flatten with exclusions

Example 4: Combine level control + exclusions

How it works

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes