A python package for extracting and formatting table metadata from databases
Project description
Extract metadata for data engineering pipelines
This repo lets you extract metadata from a database and shape it into a folder of json files that etl_manager can read.
The create_all_metadata
function will do most of the process in one go. See 'quickstart' below for a summary, or more detailed documentation below that. The end result will be:
- a json file containing a filtered list of tables
- a subfolder for the metadata
- in that subfolder, a database.json file with overall metadata for the database
- also in that subfolder, another .json file listing the columns and other metadata for each table
Requirements
This runs in Python 3.6+. You'll need to install cx_Oracle 8.0.0+.
Installing cx_Oracle also involves installing its client libraries. You can download it from Oracle or, if you're on Mac, install by Homebrew:
brew tap InstantClientTap/instantclient
brew install instantclient-basic
cx_Oracle problems 1: client host name
If you're on a Mac, you might get errors like "cx_Oracle.DatabaseError: ORA-24454: client host name is not set". If you do, you'll need to adjust your hosts file. To do this:
- go to system preferences, then sharing, and note the computer name at the top
- go to your hard disk, then the
etc
folder and find thehosts
file - back up your hosts file in case anything weird happens
- in hosts, find the line that says
127.0.0.1 localhost
- under it, add a new line that says
127.0.0.01 computer_name
, where computer_name is the one you got from system preferences/sharing - save the hosts file
cx_Oracle problems 2: client library location
Depending on how you installed the Oracle client libraries, the create_oracle_connection function might not work with default parameters.
If it won't connect, try specifying the location of your client libraries using the oracle_client_lib parameter.
Quick start
Here's an example of creating a full metadata folder by using create_all_metadata and a custom function to filter the table list:
from pathlib import Path
from extract_metadata.connect import create_oracle_connection
from extract_metadata.metadata import create_all_metadata
def table_filter(table_list):
"""Takes a list of (table, tablespace) tuples and filters it down to tables that have 'REPLICATION_TEST' in their name"""
return [t[0] for t in table_list if "REPLICATION_TEST" in t[0]]
settings_location = Path.cwd().parent / "settings_folder"
connection = create_oracle_connection("delius_sandpit", settings_location)
create_all_metadata(
connection,
save_folder="delius",
title="delius_sandpit_test",
description="Here's a description",
schema="DELIUS_ANALYTICS_PLATFORM",
source_bucket="mojap-raw-hist-dev",
source_folder="hmpps/delius/DELIUS_ANALYTICS_PLATFORM",
filter_function=table_filter,
)
connection.close()
If you save this in a script called get_metadata.py
in /metadata/ folder, you'll end up with this folder structure:
metadata
|-- get_metadata.py
|-- delius
| |-- delius_sandpit_test.json
| |-- delius_sandpit_test
| | |-- database.json
| | |-- table1.json
| | |-- table2.json
Step by step
There are 3 steps to getting metadata from a database:
- Connect using connect.create_oracle_connection
- Make a list of all the tables and filter it to the ones you want - you can do both with table_list.get_table_list
- Get metadata from the filtered list of tables using metadata.create_metadata_folder
1. Connecting
You'll need some database connection settings. These should be in a json file structured like this:
{
"host": "HOST",
"password": "PASSWORD",
"port": 1234,
"service_name": "SERVICE_NAME",
"user": "database_username"
}
Then pass this file and its location to create_oracle_connection in the connect module. You might also need to use the oracle_client_lib parameter to specify where your Oracle client libraries are. See the cx_Oracle connection problems, above.
2. Making a list of tables
get_table_list in the table_list module will use your connection to get a list of tables from the database.
It will create a timestamped json file listing the tables. Specify the file's save_folder and title when you call the function.
Also specify the database schema to get the tables from.
You might not want the file to list all the tables. In this case, pass a filter_function as well. This should be a function that takes a single argument - a list, in this format:
[("TABLE_NAME", "TABLESPACE"), ("ANOTHER_TABLE_NAME", "TABLESPACE")]
Return the tables you want as a list of table names. So if you only wanted the first one from this list, you'd want to return:
["TABLE_NAME"]
3. Get the metadata for the tables
To read a folder with etl_manager, it must contain a database.json file specifying overall features of the database, plus a json file for each table.
You can create these in one go using metadata.create_metadata_folder, or separately with metadata.create_json_for_database and metadata.create_json_for_tables.
Tests
Unit tests are written for pytest. Run pytest
from the root folder to start them.
Where functions involve SQL queries, the unit tests don't check these queries - only the Python surrounding them.
Githooks
This repo comes with some githooks to make standard checks before you commit files to Github. The checks are:
- if you're using git-crypt, run
git-crypt status
and check for unencrypted file warnings - run Black on Python files
- run Flake8 on Python files
- run yamllint on yaml files
If you want to use these, run this command from the repo's root directory:
git config core.hooksPath githooks
See the data engineering template repo for details.
Licence
Extract metadata for data engineering pipelines
This repo lets you extract metadata from a database and shape it into a folder of json files that etl_manager can read.
The create_all_metadata
function will do most of the process in one go. See 'quickstart' below for a summary, or more detailed documentation below that. The end result will be:
- a json file containing a filtered list of tables
- a subfolder for the metadata
- in that subfolder, a database.json file with overall metadata for the database
- also in that subfolder, another .json file listing the columns and other metadata for each table
Requirements
This runs in Python 3.6+. You'll need to install cx_Oracle 8.0.0+.
Installing cx_Oracle also involves installing its client libraries. You can download it from Oracle or, if you're on Mac, install by Homebrew:
brew tap InstantClientTap/instantclient
brew install instantclient-basic
cx_Oracle problems 1: client host name
If you're on a Mac, you might get errors like "cx_Oracle.DatabaseError: ORA-24454: client host name is not set". If you do, you'll need to adjust your hosts file. To do this:
- go to system preferences, then sharing, and note the computer name at the top
- go to your hard disk, then the
etc
folder and find thehosts
file - back up your hosts file in case anything weird happens
- in hosts, find the line that says
127.0.0.1 localhost
- under it, add a new line that says
127.0.0.01 computer_name
, where computer_name is the one you got from system preferences/sharing - save the hosts file
cx_Oracle problems 2: client library location
Depending on how you installed the Oracle client libraries, the create_oracle_connection function might not work with default parameters.
If it won't connect, try specifying the location of your client libraries using the oracle_client_lib parameter.
Quick start
Here's an example of creating a full metadata folder by using create_all_metadata and a custom function to filter the table list:
from pathlib import Path
from extract_metadata.connect import create_oracle_connection
from extract_metadata.metadata import create_all_metadata
def table_filter(table_list):
"""Takes a list of (table, tablespace) tuples and filters it down to tables that have 'REPLICATION_TEST' in their name"""
return [t[0] for t in table_list if "REPLICATION_TEST" in t[0]]
settings_location = Path.cwd().parent / "settings_folder"
connection = create_oracle_connection("delius_sandpit", settings_location)
create_all_metadata(
connection,
save_folder="delius",
title="delius_sandpit_test",
description="Here's a description",
schema="DELIUS_ANALYTICS_PLATFORM",
source_bucket="mojap-raw-hist-dev",
source_folder="hmpps/delius/DELIUS_ANALYTICS_PLATFORM",
filter_function=table_filter,
)
connection.close()
If you save this in a script called get_metadata.py
in /metadata/ folder, you'll end up with this folder structure:
metadata
|-- get_metadata.py
|-- delius
| |-- delius_sandpit_test.json
| |-- delius_sandpit_test
| | |-- database.json
| | |-- table1.json
| | |-- table2.json
Step by step
There are 3 steps to getting metadata from a database:
- Connect using connect.create_oracle_connection
- Make a list of all the tables and filter it to the ones you want - you can do both with table_list.get_table_list
- Get metadata from the filtered list of tables using metadata.create_metadata_folder
1. Connecting
You'll need some database connection settings. These should be in a json file structured like this:
{
"host": "HOST",
"password": "PASSWORD",
"port": 1234,
"service_name": "SERVICE_NAME",
"user": "database_username"
}
Then pass this file and its location to create_oracle_connection in the connect module. You might also need to use the oracle_client_lib parameter to specify where your Oracle client libraries are. See the cx_Oracle connection problems, above.
2. Making a list of tables
get_table_list in the table_list module will use your connection to get a list of tables from the database.
It will create a timestamped json file listing the tables. Specify the file's save_folder and title when you call the function.
Also specify the database schema to get the tables from.
You might not want the file to list all the tables. In this case, pass a filter_function as well. This should be a function that takes a single argument - a list, in this format:
[("TABLE_NAME", "TABLESPACE"), ("ANOTHER_TABLE_NAME", "TABLESPACE")]
Return the tables you want as a list of table names. So if you only wanted the first one from this list, you'd want to return:
["TABLE_NAME"]
3. Get the metadata for the tables
To read a folder with etl_manager, it must contain a database.json file specifying overall features of the database, plus a json file for each table.
You can create these in one go using metadata.create_metadata_folder, or separately with metadata.create_json_for_database and metadata.create_json_for_tables.
Tests
Unit tests are written for pytest. Run pytest
from the root folder to start them.
Where functions involve SQL queries, the unit tests don't check these queries - only the Python surrounding them.
How to update
Update and release new versions using Poetry. Make sure to change the version number in pyproject.toml
and describe the change in CHANGELOG.md
.
If you've changed any dependencies in pyproject.yaml
, run poetry update
to update poetry.lock
.
Once you've created a release in GitHub, to publish the latest version to PyPI, run:
poetry build
poetry publish -u <username>
Here, you should replace <username>
with your PyPI username. To publish to PyPI, you must be an owner of the project.
Licence
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for data-engineering-extract-metadata-1.2.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d10beb687901335bb272f2edcee67f0ede0ed3dec8f196aedad3358524cd73d |
|
MD5 | 2484034322cc5e5d4c16d2a9413bebb7 |
|
BLAKE2b-256 | eb0b367653bfd7e1bd79098b70aa83b82aace7d31f4f20b1b730bec92e5e0b03 |
Hashes for data_engineering_extract_metadata-1.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c4ec9e14e0648e761b27da9467c36ee241cd6790ef22f544d3f4586990e246b |
|
MD5 | f94b4c7d600e656a8fe8a9d16940b0c3 |
|
BLAKE2b-256 | 050ab7b4c421070edfe94fb0f73e20dfcabf17496019bbfbec2a07148b8756dc |