Skip to main content

Document-oriented to relational data conversion

Project description

relatable

relatable is a Python package for converting a collection of documents, such as a MongoDB collection, into an interrelated set of tables, such as a schema in a relational database.

Installation

pip3 install relatable

Example of use

In this example we will walk through a use case of relatable for the sample dataset found in the repository of this package in the data folder, data/example_input.json.

Each document in this dataset has a complex structure with nested objects and lists.

To generate a relational schema for this dataset, let's make an instance of RelationalSchema with the list of documents as input:

from relatable import RelationalSchema

import json

with open("data/example_input.json", "r") as fp:
    docs = json.load(fp)

rs = RelationalSchema(docs, "person")

Once the RelationalSchema is instantiated, we can check its metadata. This metadata is a list of flat dictionaries, so we can make use of Pandas to load it into a DataFrame:

import pandas as pd

pd.DataFrame(rs.generate_metadata())
table column type nullable unique
0 person person.__id__ number False True
1 person name string False True
2 person age number False True
3 experience experience.__id__ number False True
4 experience person.__id__ number False False
5 experience experience.company string False True
6 experience experience.role string False True
7 experience experience.from number False True
8 experience experience.to number False False
9 experience.technologies experience.technologies.__id__ number False True
10 experience.technologies experience.__id__ number False False
11 experience.technologies person.__id__ number False False
12 experience.technologies experience.technologies.name string False True
13 experience.technologies experience.technologies.primary boolean False False
14 experience.responsibilities experience.responsibilities.__id__ number False True
15 experience.responsibilities experience.__id__ number False False
16 experience.responsibilities person.__id__ number False False
17 experience.responsibilities experience.responsibilities.name string False True

We can see that RelationalSchema has inferred a relational schema consisting of four tables with primary keys and foreign keys interrelating the tables.

The relationships between the tables are the following:

  • The table person represents the main entity of the dataset, with a row for each person.
  • The table experience references the table person.
  • The tables experience.technologies and experience.responsibilities reference the table experience, and inherits the reference of person from experience.

Finally, let's look at each of the tables:

dfs = [pd.DataFrame(t.data).set_index(t.primary_key) for t in rs.tables]

Table person:

person.__id__ name age
0 Alice 34
1 Bob 27

Table experience:

experience.__id__ person.__id__ experience.company experience.role experience.from experience.to
0 0 Google Software Engineer 2020 2022
1 0 Facebook Senior Data Scientist 2017 2020
2 1 OpenAI NLP Engineer 2019 2022

Table experience.technologies:

experience.technologies.__id__ experience.__id__ person.__id__ experience.technologies.name experience.technologies.primary
0 0 0 C++ True
1 0 0 LolCode False
2 1 0 Python True
3 1 0 Excel False
4 2 1 Triton True
5 2 1 LaTeX False

Table experience.responsibilities:

experience.responsibilities.__id__ experience.__id__ person.__id__ experience.responsibilities.name
0 0 0 Google stuff
1 0 0 Mark TensorFlow issues as "Won't Do"
2 1 0 Censor media
3 1 0 Learn the foundations of ML
4 1 0 Do Kaggle competitions
5 2 1 Assert that GPT-2 is racist
6 2 1 Assert that GPT-3 is racist
7 2 1 Develop a prototype of a premium non-racist language model

Example of use with the Airbnb MongoDB sample dataset

Another example of use with the Airbnb MongoDB sample dataset, downloadable here can be found in the repository of this package in the script examples/airbnb_example.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

relatable-0.4.1.tar.gz (7.7 kB view hashes)

Uploaded Source

Built Distribution

relatable-0.4.1-py3-none-any.whl (6.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page