Document-oriented to relational data conversion
Project description
relatable
relatable is a Python package for converting a collection of documents, such as a MongoDB collection, into an interrelated set of tables, such as a schema in a relational database.
Installation
pip3 install relatable
Example of use
In this example we will walk through a use case of relatable for the sample dataset found in the repository of this
package in the data folder, data/example_input.json
.
Each document in this dataset has a complex structure with nested objects and lists.
To generate a relational schema for this dataset, let's make an instance of RelationalSchema with the list of documents as input:
from relatable import RelationalSchema
import json
with open("data/example_input.json", "r") as fp:
docs = json.load(fp)
rs = RelationalSchema(docs, "person")
Once the RelationalSchema is instantiated, we can check its metadata. This metadata is a list of flat dictionaries, so we can make use of Pandas to load it into a DataFrame:
import pandas as pd
pd.DataFrame(rs.generate_metadata())
table | column | type | nullable | unique | |
---|---|---|---|---|---|
0 | person | person.__id__ | number | False | True |
1 | person | name | string | False | True |
2 | person | age | number | False | True |
3 | experience | experience.__id__ | number | False | True |
4 | experience | person.__id__ | number | False | False |
5 | experience | experience.company | string | False | True |
6 | experience | experience.role | string | False | True |
7 | experience | experience.from | number | False | True |
8 | experience | experience.to | number | False | False |
9 | experience.technologies | experience.technologies.__id__ | number | False | True |
10 | experience.technologies | experience.__id__ | number | False | False |
11 | experience.technologies | person.__id__ | number | False | False |
12 | experience.technologies | experience.technologies.name | string | False | True |
13 | experience.technologies | experience.technologies.primary | boolean | False | False |
14 | experience.responsibilities | experience.responsibilities.__id__ | number | False | True |
15 | experience.responsibilities | experience.__id__ | number | False | False |
16 | experience.responsibilities | person.__id__ | number | False | False |
17 | experience.responsibilities | experience.responsibilities.name | string | False | True |
We can see that RelationalSchema has inferred a relational schema consisting of four tables with primary keys and foreign keys interrelating the tables.
The relationships between the tables are the following:
- The table person represents the main entity of the dataset, with a row for each person.
- The table experience references the table person.
- The tables experience.technologies and experience.responsibilities reference the table experience, and inherits the reference of person from experience.
Finally, let's look at each of the tables:
dfs = [pd.DataFrame(t.data).set_index(t.primary_key) for t in rs.tables]
Table person:
person.__id__ | name | age |
---|---|---|
0 | Alice | 34 |
1 | Bob | 27 |
Table experience:
experience.__id__ | person.__id__ | experience.company | experience.role | experience.from | experience.to |
---|---|---|---|---|---|
0 | 0 | Software Engineer | 2020 | 2022 | |
1 | 0 | Senior Data Scientist | 2017 | 2020 | |
2 | 1 | OpenAI | NLP Engineer | 2019 | 2022 |
Table experience.technologies:
experience.technologies.__id__ | experience.__id__ | person.__id__ | experience.technologies.name | experience.technologies.primary |
---|---|---|---|---|
0 | 0 | 0 | C++ | True |
1 | 0 | 0 | LolCode | False |
2 | 1 | 0 | Python | True |
3 | 1 | 0 | Excel | False |
4 | 2 | 1 | Triton | True |
5 | 2 | 1 | LaTeX | False |
Table experience.responsibilities:
experience.responsibilities.__id__ | experience.__id__ | person.__id__ | experience.responsibilities.name |
---|---|---|---|
0 | 0 | 0 | Google stuff |
1 | 0 | 0 | Mark TensorFlow issues as "Won't Do" |
2 | 1 | 0 | Censor media |
3 | 1 | 0 | Learn the foundations of ML |
4 | 1 | 0 | Do Kaggle competitions |
5 | 2 | 1 | Assert that GPT-2 is racist |
6 | 2 | 1 | Assert that GPT-3 is racist |
7 | 2 | 1 | Develop a prototype of a premium non-racist language model |
Example of use with the Airbnb MongoDB sample dataset
Another example of use with the Airbnb MongoDB sample dataset, downloadable
here can be
found in the repository of this package in the script examples/airbnb_example.py
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for relatable-0.4.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa162408c6e08ca602464bbbb1d8d184e732e9b1b02dbb825f4b3cca74a3a28c |
|
MD5 | 8a5ee4e28fc3d46b7c36fbb199b98038 |
|
BLAKE2b-256 | 358279c925b6056d968fc3f7c92f46624aa477d71f8855f4e4a8916c6afc47bb |