Skip to main content

Databricks PySpark module to flatten nested spark dataframes, basically struct and array of struct till the specified level

Project description

flatten_spark_dataframe

A lightweight PySpark utility to recursively flatten deeply nested Spark DataFrames — automatically expanding StructType and ArrayType(StructType) columns into clean, top-level columns.

PyPI version License: MIT


Why use this library?

Working with nested JSON data in PySpark is painful. A single deeply nested struct can require dozens of lines of manual col("a.b.c").alias(...) expressions, and arrays of structs need explicit explode() calls — all of which must be written by hand for every schema.

flatten_spark_dataframe solves this in one line:

Without this library With this library
Manually write .select() / .withColumn() for every nested field One function call: flatten(df)
Must know the full schema upfront Automatically discovers all nested columns
Exploding arrays requires separate steps Arrays of structs are exploded + flattened automatically
Renaming nested fields is tedious Clean parent_child naming convention applied automatically
No control over flattening depth Control exactly how many levels to flatten
Columns you want nested stay nested? Manual filtering. Pass an exclude_list to skip specific columns

Installation

pip install flatten-spark-dataframe

Quick Start

import flatten_spark_dataframe

# Flatten everything (all levels)
flat_df = flatten_spark_dataframe.flatten(df)

# Flatten only 1 level deep
flat_df = flatten_spark_dataframe.flatten(df, flatten_till_level=1)

# Exclude specific columns from flattening
flat_df = flatten_spark_dataframe.flatten(df, exclude_list=["address", "metadata"])

Parameters

Parameter Type Default Description
df DataFrame (required) The input PySpark DataFrame with nested columns
flatten_till_level 'complete' or int 'complete' 'complete' flattens all levels; an integer limits the depth (e.g., 1 = one level only)
exclude_list list[str] [] Column names (lowercase) to skip — these are kept nested in the output

Detailed Example

Sample Data (3 levels of nesting)

The schema below has struct inside struct (name.firstname.initial) and a top-level struct (country):

root
 |-- name: struct
 |    |-- firstname: struct        ← Level 1
 |    |    |-- initial: string     ← Level 2
 |    |    |-- actualname: string  ← Level 2
 |    |-- middlename: string       ← Level 1
 |    |-- lastname: string         ← Level 1
 |-- state: string
 |-- gender: string
 |-- country: struct
 |    |-- city: string             ← Level 1
 |    |-- street: string           ← Level 1
from pyspark.sql.types import StructType, StructField, StringType

data = [
    ((("A", "James"), None, "Smith"), "OH", "M", ("F", "Mike")),
    ((("B", "Anna"), "Rose", ""), "NY", "F", ("E", "Jen")),
    ((("C", "Julia"), "", "Williams"), "OH", "F", ("D", "Maria")),
    ((("D", "Maria"), "Anne", "Jones"), "NY", "M", ("C", "Julia")),
    ((("E", "Jen"), "Mary", "Brown"), "NY", "M", ("B", "Anna")),
    ((("F", "Mike"), "Mary", "Williams"), "OH", "M", ("A", "James")),
]

schema = StructType([
    StructField('name', StructType([
        StructField('firstname', StructType([
            StructField('initial', StringType(), True),
            StructField('actualname', StringType(), True),
        ])),
        StructField('middlename', StringType(), True),
        StructField('lastname', StringType(), True),
    ])),
    StructField('state', StringType(), True),
    StructField('gender', StringType(), True),
    StructField('country', StructType([
        StructField('city', StringType(), True),
        StructField('street', StringType(), True),
    ])),
])

df = spark.createDataFrame(data=data, schema=schema)

Example 1: Flatten completely (all levels)

import flatten_spark_dataframe

flat_df = flatten_spark_dataframe.flatten(df)
flat_df.show()

What happens internally:

  • Level 1name is expanded to name_firstname (still a struct), name_middlename, name_lastname. country is expanded to country_city, country_street.
  • Level 2name_firstname is expanded to name_firstname_initial, name_firstname_actualname.

Output:

+-----+------+---------------+-------------+------------+--------------+----------------------+--------------------------+
|state|gender|name_middlename|name_lastname|country_city|country_street|name_firstname_initial |name_firstname_actualname |
+-----+------+---------------+-------------+------------+--------------+----------------------+--------------------------+
|   OH|     M|           null|        Smith|           F|          Mike|                     A |                     James|
|   NY|     F|           Rose|             |           E|           Jen|                     B |                      Anna|
|   OH|     F|               |     Williams|           D|         Maria|                     C |                     Julia|
|   NY|     M|           Anne|        Jones|           C|         Julia|                     D |                     Maria|
|   NY|     M|           Mary|        Brown|           B|          Anna|                     E |                       Jen|
|   OH|     M|           Mary|     Williams|           A|        James |                     F |                      Mike|
+-----+------+---------------+-------------+------------+--------------+----------------------+--------------------------+

All nested structs have been fully flattened into 8 top-level columns.


Example 2: Flatten only 1 level deep

flat_df_l1 = flatten_spark_dataframe.flatten(df, flatten_till_level=1)
flat_df_l1.printSchema()

Output schema:

root
 |-- state: string
 |-- gender: string
 |-- name_firstname: struct       ← Still nested (would need level 2 to expand)
 |    |-- initial: string
 |    |-- actualname: string
 |-- name_middlename: string      ← Flattened from name.middlename
 |-- name_lastname: string        ← Flattened from name.lastname
 |-- country_city: string         ← Flattened from country.city
 |-- country_street: string       ← Flattened from country.street

Only the first level of structs is expanded. name_firstname remains a struct because it was at level 2.


Example 3: Flatten with exclusions

flat_df_excl = flatten_spark_dataframe.flatten(df, exclude_list=["country"])
flat_df_excl.printSchema()

Output schema:

root
 |-- country: struct              ← Kept nested (excluded)
 |    |-- city: string
 |    |-- street: string
 |-- state: string
 |-- gender: string
 |-- name_middlename: string
 |-- name_lastname: string
 |-- name_firstname_initial: string
 |-- name_firstname_actualname: string

The country struct is preserved as-is while everything else is fully flattened.


Example 4: Combine level control + exclusions

flat_df_combo = flatten_spark_dataframe.flatten(df, flatten_till_level=1, exclude_list=["country"])
flat_df_combo.printSchema()

Output schema:

root
 |-- country: struct              ← Excluded — kept nested
 |    |-- city: string
 |    |-- street: string
 |-- state: string
 |-- gender: string
 |-- name_firstname: struct       ← Level 2 — not flattened (limit = 1)
 |    |-- initial: string
 |    |-- actualname: string
 |-- name_middlename: string
 |-- name_lastname: string

How it works

  1. Classifies columns into flat (primitives), struct, and array-of-struct categories
  2. Expands structs into sub-fields using parent_child naming (special characters are cleaned)
  3. Explodes arrays of structs using explode_outer() (preserves rows even when the array is null/empty)
  4. Recurses until all levels are flattened or the depth limit is reached
  5. Handles duplicates — if a flattened field name collides with an existing column, a suffix is appended

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flatten_spark_dataframe-0.0.2.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flatten_spark_dataframe-0.0.2-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file flatten_spark_dataframe-0.0.2.tar.gz.

File metadata

  • Download URL: flatten_spark_dataframe-0.0.2.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for flatten_spark_dataframe-0.0.2.tar.gz
Algorithm Hash digest
SHA256 fcc95a1772516e0157f8cfa50a09d252c094d932b05904f33e16c1646c8a2aab
MD5 983ada36b257b429b75e62238f35cf26
BLAKE2b-256 60c8fd8d3b8e342e4d7db2c78ed46bd6e1273b9eebe31aa19d746b443c05ec42

See more details on using hashes here.

File details

Details for the file flatten_spark_dataframe-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for flatten_spark_dataframe-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f9f69627f798f81bbfacdcfb2236846e20b657ce9b4fc8e276a4b2bfb395ef59
MD5 fcd2202f6bba03af8cdbd3af57aafa1a
BLAKE2b-256 ff05d9b862be4e21dc948ddf3faaacf1fb842f748555818a7767bf86bf4e1014

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page