Skip to main content

A library to flatten nested data.

Project description

dataflat

A library to flatten all this annoiyng nested keys and columns on Dictionaries, Pandas Dataframes and Spark (pyspark) Dataframes.

Installation

pip install dataflat

Get started

How to instantiate a Flattener:

  1. Import CaseTranslatorOptions, and FlattenerOptions

In this step we define the case translator used to convert keys, or column names after the flattening process.

It's necessary to select the required Flattener, for example DICTIONARY or PYSPARK_DF (more coming...) Secondly it's optional to select a from_case and a to_case option, for example SNAKE and CAMEL respectively. Finally we need to set a replace string, this is the string used to indicate the nested dependency, for example client.id or item.price, also we can specify if it's needed to remove special characters like '@' or '|' keys or column names.

from dataflat.flattener_handler import CaseTranslatorOptions, FlattenerOptions

# Default values:
#   from_case = None
#   to_case = None
#   replace_string = "."
#   remove_special_chars = False
custom_flattener = FlattenerOptions.DICTIONARY
from_case = CaseTranslatorOptions.SNAKE
to_case = CaseTranslatorOptions.CAMEL
replace_string = "."
remove_special_chars = False

After that we can proceed to instantiate a flattener using the handler, and passing it the variables defined before, and flatten some data. All CustomFlattener receive the same parameters on the flatten function.

from dataflat.flattener_handler import handler

flattener = handler(
    custom_flattener=custom_flattener,
    from_case=from_case,
    to_case=to_case,
    replace_string=replace_string,
    remove_special_chars=remove_special_chars
)


# Default values:
#   entity_name = "data"
#   primary_key = "id
#   partition_keys = []
#   black_list = []
data={}
entity_name = "data"
primary_key = "id"
partition_keys = ["date"]
black_list = ['keys.or', 'columns', 'to.be.ignored']
flatten_data = flattener.flatten(
    data=data,
    primary_key=primary_key, 
    entity_name=entity_name,
    partition_keys=partition_keys,
    black_list=black_list
)
  • primary_key: Used to connect dictionaries or dataframes with the "parent" dataframe, for example id, its required that the dictionary or dataframe contains a 'primary_key' that can be propagated to 'child'.
  • partition_keys: List of keys that you must want to propagate to nested data, for example "date" key/column.
  • black_list: List of keys that must me ignored/skipped during the flattening process, this keys will not be present on the flatten_data.
  • flatten_data: A dictionary with one or multiples keys, one for the "parent" and much for the "child". Each list or array inside the "original" data will result on a key on this dictionary, an example of flatten_data could be:
{
  "data": [{"id": 1, "date": "2024-01-01", "total": 1900}],
  "data.orders": [
    {"id": "abc123", "total": 700, "data.id": 1, "data.date": "2024-01-01", "index": 0},
    {"id": "dfg456", "total": 1200, "data.id": 1, "data.date": "2024-01-01", "index": 1}
  ],
  "data.orders.products": [
    {"id": "ab", "price": 200, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 0, "index": 0},
    {"id": "cd", "price": 500, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 0, "index": 1},
    {"id": "fg", "price": 1200, "data.id": 1, "data.date": "2024-01-01", "data.orders.index": 1, "index": 0}
  ]
}

The original json that result in this data was:

{
  "id":  1,
  "date": "2024-01-01",
  "orders": [
    {
      "id":  "abc123",
      "products": [
        {"id":  "ab", "price":  200},
        {"id":  "cd", "price":  500}
      ],
      "total":  700
    },
    {
      "id": "dfg456",
      "products": [
        {"id":  "fg", "price":  1200}
      ],
      "total":  1200}
  ],
  "total":  1900
}

Recommendations

  1. For PYSPARK_DF flattener it's recommended to set the 'caseSensitive' configuration to True on Spark.
        spark.conf.set('spark.sql.caseSensitive', True)
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataflat-2.0.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

dataflat-2.0.0-py3-none-any.whl (22.6 kB view details)

Uploaded Python 3

File details

Details for the file dataflat-2.0.0.tar.gz.

File metadata

  • Download URL: dataflat-2.0.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for dataflat-2.0.0.tar.gz
Algorithm Hash digest
SHA256 fe7902073ff5fcd2919b16ff872be403e197421802139d9182ad97d0f81faa6b
MD5 044e69874c0622fb9ebb4fd6d0932154
BLAKE2b-256 139125ed03d6c36da551a012a587d4efab9e072eaf79e4d91aa6dcb90e041c36

See more details on using hashes here.

File details

Details for the file dataflat-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: dataflat-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 22.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for dataflat-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 078a3505e31c38521c1ecb947a3115c18d84ad402708152ebf4dd5862f05227c
MD5 83740c5e302aae9b11368a6e0aced775
BLAKE2b-256 8bd31d5de170fab57cd003b82e072d0edefe6666099ca3fe06a3d4e27bd2821a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page