Databricks PySpark module to flatten nested spark dataframes, basically struct and array of struct till the specified level
Project description
flatten_spark_dataframe
A lightweight PySpark utility to recursively flatten deeply nested Spark DataFrames — automatically expanding StructType and ArrayType(StructType) columns into clean, top-level columns.
Why use this library?
Working with nested JSON data in PySpark is painful. A single deeply nested struct can require dozens of lines of manual col("a.b.c").alias(...) expressions, and arrays of structs need explicit explode() calls — all of which must be written by hand for every schema.
flatten_spark_dataframe solves this in one line:
| Without this library | With this library |
|---|---|
Manually write .select() / .withColumn() for every nested field |
One function call: flatten(df) |
| Must know the full schema upfront | Automatically discovers all nested columns |
| Exploding arrays requires separate steps | Arrays of structs are exploded + flattened automatically |
| Renaming nested fields is tedious | Clean parent_child naming convention applied automatically |
| No control over flattening depth | Control exactly how many levels to flatten |
| Columns you want nested stay nested? Manual filtering. | Pass an exclude_list to skip specific columns |
Installation
pip install flatten-spark-dataframe
Quick Start
import flatten_spark_dataframe
# Flatten everything (all levels)
flat_df = flatten_spark_dataframe.flatten(df)
# Flatten only 1 level deep
flat_df = flatten_spark_dataframe.flatten(df, flatten_till_level=1)
# Exclude specific columns from flattening
flat_df = flatten_spark_dataframe.flatten(df, exclude_list=["address", "metadata"])
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
df |
DataFrame | (required) | The input PySpark DataFrame with nested columns |
flatten_till_level |
'complete' or int |
'complete' |
'complete' flattens all levels; an integer limits the depth (e.g., 1 = one level only) |
exclude_list |
list[str] |
[] |
Column names (lowercase) to skip — these are kept nested in the output |
Detailed Example
Sample Data (3 levels of nesting)
The schema below has struct inside struct (name.firstname.initial) and a top-level struct (country):
root
|-- name: struct
| |-- firstname: struct ← Level 1
| | |-- initial: string ← Level 2
| | |-- actualname: string ← Level 2
| |-- middlename: string ← Level 1
| |-- lastname: string ← Level 1
|-- state: string
|-- gender: string
|-- country: struct
| |-- city: string ← Level 1
| |-- street: string ← Level 1
from pyspark.sql.types import StructType, StructField, StringType
data = [
((("A", "James"), None, "Smith"), "OH", "M", ("F", "Mike")),
((("B", "Anna"), "Rose", ""), "NY", "F", ("E", "Jen")),
((("C", "Julia"), "", "Williams"), "OH", "F", ("D", "Maria")),
((("D", "Maria"), "Anne", "Jones"), "NY", "M", ("C", "Julia")),
((("E", "Jen"), "Mary", "Brown"), "NY", "M", ("B", "Anna")),
((("F", "Mike"), "Mary", "Williams"), "OH", "M", ("A", "James")),
]
schema = StructType([
StructField('name', StructType([
StructField('firstname', StructType([
StructField('initial', StringType(), True),
StructField('actualname', StringType(), True),
])),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True),
])),
StructField('state', StringType(), True),
StructField('gender', StringType(), True),
StructField('country', StructType([
StructField('city', StringType(), True),
StructField('street', StringType(), True),
])),
])
df = spark.createDataFrame(data=data, schema=schema)
Example 1: Flatten completely (all levels)
import flatten_spark_dataframe
flat_df = flatten_spark_dataframe.flatten(df)
flat_df.show()
What happens internally:
- Level 1 —
nameis expanded toname_firstname(still a struct),name_middlename,name_lastname.countryis expanded tocountry_city,country_street. - Level 2 —
name_firstnameis expanded toname_firstname_initial,name_firstname_actualname.
Output:
+-----+------+---------------+-------------+------------+--------------+----------------------+--------------------------+
|state|gender|name_middlename|name_lastname|country_city|country_street|name_firstname_initial |name_firstname_actualname |
+-----+------+---------------+-------------+------------+--------------+----------------------+--------------------------+
| OH| M| null| Smith| F| Mike| A | James|
| NY| F| Rose| | E| Jen| B | Anna|
| OH| F| | Williams| D| Maria| C | Julia|
| NY| M| Anne| Jones| C| Julia| D | Maria|
| NY| M| Mary| Brown| B| Anna| E | Jen|
| OH| M| Mary| Williams| A| James | F | Mike|
+-----+------+---------------+-------------+------------+--------------+----------------------+--------------------------+
All nested structs have been fully flattened into 8 top-level columns.
Example 2: Flatten only 1 level deep
flat_df_l1 = flatten_spark_dataframe.flatten(df, flatten_till_level=1)
flat_df_l1.printSchema()
Output schema:
root
|-- state: string
|-- gender: string
|-- name_firstname: struct ← Still nested (would need level 2 to expand)
| |-- initial: string
| |-- actualname: string
|-- name_middlename: string ← Flattened from name.middlename
|-- name_lastname: string ← Flattened from name.lastname
|-- country_city: string ← Flattened from country.city
|-- country_street: string ← Flattened from country.street
Only the first level of structs is expanded. name_firstname remains a struct because it was at level 2.
Example 3: Flatten with exclusions
flat_df_excl = flatten_spark_dataframe.flatten(df, exclude_list=["country"])
flat_df_excl.printSchema()
Output schema:
root
|-- country: struct ← Kept nested (excluded)
| |-- city: string
| |-- street: string
|-- state: string
|-- gender: string
|-- name_middlename: string
|-- name_lastname: string
|-- name_firstname_initial: string
|-- name_firstname_actualname: string
The country struct is preserved as-is while everything else is fully flattened.
Example 4: Combine level control + exclusions
flat_df_combo = flatten_spark_dataframe.flatten(df, flatten_till_level=1, exclude_list=["country"])
flat_df_combo.printSchema()
Output schema:
root
|-- country: struct ← Excluded — kept nested
| |-- city: string
| |-- street: string
|-- state: string
|-- gender: string
|-- name_firstname: struct ← Level 2 — not flattened (limit = 1)
| |-- initial: string
| |-- actualname: string
|-- name_middlename: string
|-- name_lastname: string
How it works
- Classifies columns into flat (primitives), struct, and array-of-struct categories
- Expands structs into sub-fields using
parent_childnaming (special characters are cleaned) - Explodes arrays of structs using
explode_outer()(preserves rows even when the array is null/empty) - Recurses until all levels are flattened or the depth limit is reached
- Handles duplicates — if a flattened field name collides with an existing column, a suffix is appended
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flatten_spark_dataframe-0.0.2.tar.gz.
File metadata
- Download URL: flatten_spark_dataframe-0.0.2.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcc95a1772516e0157f8cfa50a09d252c094d932b05904f33e16c1646c8a2aab
|
|
| MD5 |
983ada36b257b429b75e62238f35cf26
|
|
| BLAKE2b-256 |
60c8fd8d3b8e342e4d7db2c78ed46bd6e1273b9eebe31aa19d746b443c05ec42
|
File details
Details for the file flatten_spark_dataframe-0.0.2-py3-none-any.whl.
File metadata
- Download URL: flatten_spark_dataframe-0.0.2-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9f69627f798f81bbfacdcfb2236846e20b657ce9b4fc8e276a4b2bfb395ef59
|
|
| MD5 |
fcd2202f6bba03af8cdbd3af57aafa1a
|
|
| BLAKE2b-256 |
ff05d9b862be4e21dc948ddf3faaacf1fb842f748555818a7767bf86bf4e1014
|