Document-oriented to relational conversion
Project description
relatable
relatable is a Python package for converting a collection of documents, such as a MongoDB collection, into an interrelated set of tables, such as a schema in a relational database.
Installation
pip3 install relatable
Example of use
In this example we will walk through a use case of relatable
for the sample dataset found in the repository of this
package in the data folder.
To generate a relational schema for this dataset, let's instantiate a RelationalSchema
with the list of documents as
input:
from relatable import RelationalSchema
import json
with open("data/example_input.json", "r") as fp:
docs = json.load(fp)
rs = RelationalSchema(docs)
Once the RelationalSchema
is instantiated, we can check its metadata. This metadata is a list of flat dictionaries, so
we can make use of Pandas to load it into a DataFrame:
import pandas as pd
pd.DataFrame(rs.generate_metadata())
table | field | type | nullable | unique | |
---|---|---|---|---|---|
0 | t0 | t0_id | Integer | False | True |
1 | t0 | name | String | False | True |
2 | t0 | age | Integer | False | True |
3 | t1 | t1_id | Integer | False | True |
4 | t1 | t0_id | Integer | False | False |
5 | t1 | experience.company | String | False | True |
6 | t1 | experience.role | String | False | True |
7 | t1 | experience.from | Integer | False | True |
8 | t1 | experience.to | Integer | False | False |
9 | t2 | t2_id | Integer | False | True |
10 | t2 | t1_id | Integer | False | False |
11 | t2 | t0_id | Integer | False | False |
12 | t2 | experience.technologies | String | False | True |
13 | t3 | t3_id | Integer | False | True |
14 | t3 | t1_id | Integer | False | False |
15 | t3 | t0_id | Integer | False | False |
16 | t3 | experience.responsibilities | String | False | True |
We can see that RelationalSchema
has inferred a relational schema consisting of four tables with primary keys and
foreign keys interrelating the tables.
It would be nice to rename these tables with a more descriptive name, and also rename some columns. We can do so with
the rename
and rename_column
methods:
rs.rename_table("t0", "person")
rs.rename_table("t1", "job")
rs.rename_table("t2", "technology")
rs.rename_table("t3", "responsibility")
for name in ["company", "role", "from", "to"]:
rs.rename_column("job", f"experience.{name}", name)
rs.rename_column("technology", "experience.technologies", "technology")
rs.rename_column("responsibility", "experience.responsibilities", "responsibility")
pd.DataFrame(rs.generate_metadata())
table | field | type | nullable | unique | |
---|---|---|---|---|---|
0 | person | person_id | Integer | False | True |
1 | person | name | String | False | True |
2 | person | age | Integer | False | True |
3 | job | job_id | Integer | False | True |
4 | job | person_id | Integer | False | False |
5 | job | company | String | False | True |
6 | job | role | String | False | True |
7 | job | from | Integer | False | True |
8 | job | to | Integer | False | False |
9 | technology | technology_id | Integer | False | True |
10 | technology | job_id | Integer | False | False |
11 | technology | person_id | Integer | False | False |
12 | technology | technology | String | False | True |
13 | responsibility | responsibility_id | Integer | False | True |
14 | responsibility | job_id | Integer | False | False |
15 | responsibility | person_id | Integer | False | False |
16 | responsibility | responsibility | String | False | True |
The relationships between the tables are the following:
- The table
person
represents the main entity of the dataset, with a row for each person. - The table
job
references the tableperson
. - The tables
technology
andresponsibility
reference the tablejob
, and inherits the reference ofperson
fromjob
.
Finally, let's look at each of the tables:
dfs = [pd.DataFrame(t.data).set_index(t.primary_key) for t in rs.tables]
Table person
:
person_id | name | age |
---|---|---|
0 | Alice | 34 |
1 | Bob | 27 |
Table job
:
job_id | person_id | company | role | from | to |
---|---|---|---|---|---|
0 | 0 | Software Engineer | 2020 | 2022 | |
1 | 0 | Senior Data Scientist | 2017 | 2020 | |
2 | 1 | OpenAI | NLP Engineer | 2019 | 2022 |
Table technology
:
technology_id | job_id | person_id | technology |
---|---|---|---|
0 | 0 | 0 | C++ |
1 | 0 | 0 | LolCode |
2 | 1 | 0 | Python |
3 | 1 | 0 | Excel |
4 | 2 | 1 | Triton |
5 | 2 | 1 | LaTeX |
Table responsibility
:
responsibility_id | job_id | person_id | responsibility |
---|---|---|---|
0 | 0 | 0 | Google stuff |
1 | 0 | 0 | Mark TensorFlow issues as "Won't Do" |
2 | 1 | 0 | Censor media |
3 | 1 | 0 | Learn the foundations of ML |
4 | 1 | 0 | Do Kaggle competitions |
5 | 2 | 1 | Assert that GPT-2 is racist |
6 | 2 | 1 | Assert that GPT-3 is racist |
7 | 2 | 1 | Develop a prototype of a premium non-racist language model |
Example of use with the Airbnb MongoDB sample dataset
Another example of use with the Airbnb MongoDB sample dataset, downloadable
here can be
found in the repository of this package in the script airbnb_example.py
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for relatable-0.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44e5de9c96e7d4576bd98fb4530ad7d3ea2a8dad1693224c4d3b42cbaa4037cf |
|
MD5 | 99ea23d3e6ee51d63fd7dcf8e5a80be3 |
|
BLAKE2b-256 | 50fa9be1adfc9fc79ae44f94dbb40a59a659556d1c24541cfa6eff662c275138 |