Skip to main content

Document-oriented to relational conversion

Project description

relatable

relatable is a Python package for converting a collection of documents, such as a MongoDB collection, into an interrelated set of tables, such as a schema in a relational database.

Installation

pip3 install relatable

Example of use

In this example we will walk through a use case of relatable for the sample dataset found in the repository of this package in the data folder.

To generate a relational schema for this dataset, let's instantiate a RelationalSchema with the list of documents as input:

from relatable import RelationalSchema

import json

with open("data/example_input.json", "r") as fp:
    docs = json.load(fp)

rs = RelationalSchema(docs)

Once the RelationalSchema is instantiated, we can check its metadata. This metadata is a list of flat dictionaries, so we can make use of Pandas to load it into a DataFrame:

import pandas as pd

pd.DataFrame(rs.generate_metadata())
table field type nullable unique
0 t0 t0_id Integer False True
1 t0 name String False True
2 t0 age Integer False True
3 t1 t1_id Integer False True
4 t1 t0_id Integer False False
5 t1 experience.company String False True
6 t1 experience.role String False True
7 t1 experience.from Integer False True
8 t1 experience.to Integer False False
9 t2 t2_id Integer False True
10 t2 t1_id Integer False False
11 t2 t0_id Integer False False
12 t2 experience.technologies String False True
13 t3 t3_id Integer False True
14 t3 t1_id Integer False False
15 t3 t0_id Integer False False
16 t3 experience.responsibilities String False True

We can see that RelationalSchema has inferred a relational schema consisting of four tables with primary keys and foreign keys interrelating the tables.

It would be nice to rename these tables with a more descriptive name, and also rename some columns. We can do so with the rename and rename_column methods:

rs.rename_table("t0", "person")
rs.rename_table("t1", "job")
rs.rename_table("t2", "technology")
rs.rename_table("t3", "responsibility")

for name in ["company", "role", "from", "to"]:
    rs.rename_column("job", f"experience.{name}", name)

rs.rename_column("technology", "experience.technologies", "technology")
rs.rename_column("responsibility", "experience.responsibilities", "responsibility")

pd.DataFrame(rs.generate_metadata())
table field type nullable unique
0 person person_id Integer False True
1 person name String False True
2 person age Integer False True
3 job job_id Integer False True
4 job person_id Integer False False
5 job company String False True
6 job role String False True
7 job from Integer False True
8 job to Integer False False
9 technology technology_id Integer False True
10 technology job_id Integer False False
11 technology person_id Integer False False
12 technology technology String False True
13 responsibility responsibility_id Integer False True
14 responsibility job_id Integer False False
15 responsibility person_id Integer False False
16 responsibility responsibility String False True

The relationships between the tables are the following:

  • The table person represents the main entity of the dataset, with a row for each person.
  • The table job references the table person.
  • The tables technology and responsibility reference the table job, and inherits the reference of person from job.

Finally, let's look at each of the tables:

dfs = [pd.DataFrame(t.data).set_index(t.primary_key) for t in rs.tables]

Table person:

person_id name age
0 Alice 34
1 Bob 27

Table job:

job_id person_id company role from to
0 0 Google Software Engineer 2020 2022
1 0 Facebook Senior Data Scientist 2017 2020
2 1 OpenAI NLP Engineer 2019 2022

Table technology:

technology_id job_id person_id technology
0 0 0 C++
1 0 0 LolCode
2 1 0 Python
3 1 0 Excel
4 2 1 Triton
5 2 1 LaTeX

Table responsibility:

responsibility_id job_id person_id responsibility
0 0 0 Google stuff
1 0 0 Mark TensorFlow issues as "Won't Do"
2 1 0 Censor media
3 1 0 Learn the foundations of ML
4 1 0 Do Kaggle competitions
5 2 1 Assert that GPT-2 is racist
6 2 1 Assert that GPT-3 is racist
7 2 1 Develop a prototype of a premium non-racist language model

Example of use with the Airbnb MongoDB sample dataset

Another example of use with the Airbnb MongoDB sample dataset, downloadable here can be found in the repository of this package in the script airbnb_example.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

relatable-0.2.1.tar.gz (8.1 kB view hashes)

Uploaded Source

Built Distribution

relatable-0.2.1-py3-none-any.whl (7.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page