Skip to main content

A utility for converting/transporting arbitrary JSON data collections into a relational database

Project description

Relationalize

Relationalize is a Python library for transforming collections of JSON objects, into a relational-friendly format. It draws inspiration from the AWS Glue Relationalize transform.

JSON Object Collections

When working with JSON often there are collections of objects with the same or similar structure. For example, in a NoSQL database there may be a collection describing users with the following two documents/objects:

//Document 1
{
    "username": "jsmith123",
    "created_at": "2022-12-15T20:24:26.637Z",
    "contact": {
        "email_address": "jsmith123@gmail.com",
        "phone_number": 1234567890
    },
    "connections": [
        "jdoe456",
        "elowry789"
    ]
}
// Document 2
{
    "username": "jdoe456",
    "created_at": 1671135896468,
    "contact": {
        "email_address": "jdoe899@yahoo.com",
        "address": {
            "address_1": "77 Middlesex Avenue",
            "address_2": "Suite A",
            "city": "Somerville",
            "state": "MA",
            "zip_code": "02145"
        }
    },
    "connections": [
        "jsmith123",
        "hjones99"
    ]
}

There are a number of challenges that must be overcome to move this data into a relational-database friendly format:

  1. Nested Objects (ex: "contact" field)
  2. Different data types in the same column (ex: "created_at" field)
  3. Sparse columns (ex: "contact.phone_number" & "contact.address" field)
  4. Sub-Arrays (ex: "connections" field)

This package provides a solutution to all of these challenges with more portability and flexibility, and less limitations than AWS Glue relationalize.

How Relationalize works

The relationalize function recursively navigates the JSON object and splits out new ojects/collections whenever an array is encountered and provides a connection/relation between the objects. You provide the Relationalize class a function which will determine where to write the transformed content. This could be a local file object, a remote (s3) file object, or an in memory buffer. Additionally any nested objects are flattened. Each object that is output by relationalize is a flat JSON object.

This package also provides a Schema class which can generate a schema for a collection of flat JSON objects. This schema can be used to handle type ambigouity and generate SQL DDL.

For example, the schemas generated by relationalizing and schema generating the above collection's objects would be:

// users
{
    "username": "str",
    "created_at": "c-int-str",
    "contact_email_address": "str",
    "contact_phone_number": "int",
    "contact_address_address_1": "str",
    "contact_address_address_2": "str",
    "contact_address_city": "str",
    "contact_address_state": "str",
    "contact_address_zip_code": "str",
    "connections": "str"

}
//users_connections
{
    "connections__rid_": "str",
    "connections__index_": "int",
    "connection__val_": "str"
}

When processing a collection of JSON objects, the schema is not known, so we must provide a way for the relationalize class to store the new collections it will potentially create. This could be a local or remote file, an in memory buffer, etc...

The relationalize class constructor takes in a function with the signature (identifier: str) -> TextIO as an argument (create_output). This function is used to create the outputs.

The relationalize class constructor also takes in an optional function that will be called whenever an object is written to a file that was created via the create_output function. This method can be utilized to generate the schemas as the objects are encountered, reducing the number of iterations needed over the objects.

For example:

schemas: Dict[str, Schema] = {}

def on_object_write(schema: str, object: dict):
  if schema not in schemas:
      schemas[schema] = Schema()
  schemas[schema].read_object(object)

with Relationalize('object_name', on_object_write=on_object_write) as r:
    r.relationalize([{...}, {...}])

Once the collection has been relationalized and the schemas have been generated, you can utilize the convert_object method to create the final json object, which could be loaded into a database. The convert_object method will break out any ambigously typed columns into seperate columns.

For example the first document in the users collection would output the following three documents after being processed by relationalize and convert_object:

// users
{
    "username": "jsmith123",
    "created_at_str": "2022-12-15T20:24:26.637Z",
    "contact_email_address": "jsmith123@gmail.com",
    "contact_phone_number": 1234567890,
    "connections": "R_969c799a3177437d98074d985861242b"
}
// users_connections
{
    "connections__rid_": "R_969c799a3177437d98074d985861242b",
    "connections__index_": 0,
    "connection__val_": "jdoe456"
}
{
    "connections__rid_": "R_969c799a3177437d98074d985861242b",
    "connections__index_": 1,
    "connection__val_": "elowry789"
}

Installation

Use the package manager pip to install relationalize.

pip install relationalize

Examples

Examples are placed in the examples/ folder. These examples are intended to be run from the working directory of examples.

We recommend starting with the local_fs_example.py and then moving to the memory_example.py.

For a complete API to database pipeline check out the full_pokemon_s3_redshift_pipeline.py example.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

relationalize-0.1.4.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

relationalize-0.1.4-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file relationalize-0.1.4.tar.gz.

File metadata

  • Download URL: relationalize-0.1.4.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for relationalize-0.1.4.tar.gz
Algorithm Hash digest
SHA256 7c6e15bbbee28e316e4d95ed11b55455df0395551ae474452216dd5bd88abd79
MD5 a2573c351761ad64cf438724fc4b1b55
BLAKE2b-256 6f731907806fac7582641cb116b094527d7e46ee4bb6eb302f09781f2045ec07

See more details on using hashes here.

File details

Details for the file relationalize-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: relationalize-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for relationalize-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8b13c7705885fc658c357e1ea6f75a13e7915963ac50893f51753f866be6edc8
MD5 9c855145c887839e991697a0381da740
BLAKE2b-256 8c27e67bf830982d9c46d18c64a899716fd850a4e101100d78cdefa834def51a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page