Skip to main content

A python package for extracting and formatting table metadata from databases

Project description

Extract metadata for data engineering pipelines

This repo lets you extract metadata from a database and shape it into a folder of json files that etl_manager can read.

The create_all_metadata function will do most of the process in one go. See 'quickstart' below for a summary, or more detailed documentation below that. The end result will be:

  • a json file containing a filtered list of tables
  • a subfolder for the metadata
  • in that subfolder, a database.json file with overall metadata for the database
  • also in that subfolder, another .json file listing the columns and other metadata for each table

Requirements

This runs in Python 3.6+. You'll need to install cx_Oracle 8.0.0+.

Installing cx_Oracle also involves installing its client libraries. You can download it from Oracle or, if you're on Mac, install by Homebrew:

brew tap InstantClientTap/instantclient brew install instantclient-basic

cx_Oracle problems 1: client host name

If you're on a Mac, you might get errors like "cx_Oracle.DatabaseError: ORA-24454: client host name is not set". If you do, you'll need to adjust your hosts file. To do this:

  • go to system preferences, then sharing, and note the computer name at the top
  • go to your hard disk, then the etc folder and find the hosts file
  • back up your hosts file in case anything weird happens
  • in hosts, find the line that says 127.0.0.1 localhost
  • under it, add a new line that says 127.0.0.01 computer_name, where computer_name is the one you got from system preferences/sharing
  • save the hosts file

cx_Oracle problems 2: client library location

Depending on how you installed the Oracle client libraries, the create_oracle_connection function might not work with default parameters.

If it won't connect, try specifying the location of your client libraries using the oracle_client_lib parameter.

Quick start

Here's an example of creating a full metadata folder by using create_all_metadata and a custom function to filter the table list:

from pathlib import Path
from extract_metadata.connect import create_oracle_connection
from extract_metadata.metadata import create_all_metadata


def table_filter(table_list):
    """Takes a list of (table, tablespace) tuples and filters it down to tables that have 'REPLICATION_TEST' in their name"""
    return [t[0] for t in table_list if "REPLICATION_TEST" in t[0]]


settings_location = Path.cwd().parent / "settings_folder"
connection = create_oracle_connection("delius_sandpit", settings_location)

create_all_metadata(
    connection,
    save_folder="delius",
    title="delius_sandpit_test",
    description="Here's a description",
    schema="DELIUS_ANALYTICS_PLATFORM",
    source_bucket="mojap-raw-hist-dev",
    source_folder="hmpps/delius/DELIUS_ANALYTICS_PLATFORM",
    filter_function=table_filter,
)

connection.close()

If you save this in a script called get_metadata.py in /metadata/ folder, you'll end up with this folder structure:

metadata
|-- get_metadata.py
|-- delius
|   |-- delius_sandpit_test.json
|   |-- delius_sandpit_test
|   |   |-- database.json
|   |   |-- table1.json
|   |   |-- table2.json

Step by step

There are 3 steps to getting metadata from a database:

  1. Connect using connect.create_oracle_connection
  2. Make a list of all the tables and filter it to the ones you want - you can do both with table_list.get_table_list
  3. Get metadata from the filtered list of tables using metadata.create_metadata_folder

1. Connecting

You'll need some database connection settings. These should be in a json file structured like this:

{
    "host": "HOST",
    "password": "PASSWORD",
    "port": 1234,
    "service_name": "SERVICE_NAME",
    "user": "database_username"
}

Then pass this file and its location to create_oracle_connection in the connect module. You might also need to use the oracle_client_lib parameter to specify where your Oracle client libraries are. See the cx_Oracle connection problems, above.

2. Making a list of tables

get_table_list in the table_list module will use your connection to get a list of tables from the database.

It will create a timestamped json file listing the tables. Specify the file's save_folder and title when you call the function.

Also specify the database schema to get the tables from.

You might not want the file to list all the tables. In this case, pass a filter_function as well. This should be a function that takes a single argument - a list, in this format:

[("TABLE_NAME", "TABLESPACE"), ("ANOTHER_TABLE_NAME", "TABLESPACE")]

Return the tables you want as a list of table names. So if you only wanted the first one from this list, you'd want to return:

["TABLE_NAME"]

3. Get the metadata for the tables

To read a folder with etl_manager, it must contain a database.json file specifying overall features of the database, plus a json file for each table.

You can create these in one go using metadata.create_metadata_folder, or separately with metadata.create_json_for_database and metadata.create_json_for_tables.

Tests

Unit tests are written for pytest. Run pytest from the root folder to start them.

Where functions involve SQL queries, the unit tests don't check these queries - only the Python surrounding them.

Githooks

This repo comes with some githooks to make standard checks before you commit files to Github. The checks are:

  • if you're using git-crypt, run git-crypt status and check for unencrypted file warnings
  • run Black on Python files
  • run Flake8 on Python files
  • run yamllint on yaml files

If you want to use these, run this command from the repo's root directory:

git config core.hooksPath githooks

See the data engineering template repo for details.

Licence

MIT Licence

Extract metadata for data engineering pipelines

This repo lets you extract metadata from a database and shape it into a folder of json files that etl_manager can read.

The create_all_metadata function will do most of the process in one go. See 'quickstart' below for a summary, or more detailed documentation below that. The end result will be:

  • a json file containing a filtered list of tables
  • a subfolder for the metadata
  • in that subfolder, a database.json file with overall metadata for the database
  • also in that subfolder, another .json file listing the columns and other metadata for each table

Requirements

This runs in Python 3.6+. You'll need to install cx_Oracle 8.0.0+.

Installing cx_Oracle also involves installing its client libraries. You can download it from Oracle or, if you're on Mac, install by Homebrew:

brew tap InstantClientTap/instantclient brew install instantclient-basic

cx_Oracle problems 1: client host name

If you're on a Mac, you might get errors like "cx_Oracle.DatabaseError: ORA-24454: client host name is not set". If you do, you'll need to adjust your hosts file. To do this:

  • go to system preferences, then sharing, and note the computer name at the top
  • go to your hard disk, then the etc folder and find the hosts file
  • back up your hosts file in case anything weird happens
  • in hosts, find the line that says 127.0.0.1 localhost
  • under it, add a new line that says 127.0.0.01 computer_name, where computer_name is the one you got from system preferences/sharing
  • save the hosts file

cx_Oracle problems 2: client library location

Depending on how you installed the Oracle client libraries, the create_oracle_connection function might not work with default parameters.

If it won't connect, try specifying the location of your client libraries using the oracle_client_lib parameter.

Quick start

Here's an example of creating a full metadata folder by using create_all_metadata and a custom function to filter the table list:

from pathlib import Path
from extract_metadata.connect import create_oracle_connection
from extract_metadata.metadata import create_all_metadata


def table_filter(table_list):
    """Takes a list of (table, tablespace) tuples and filters it down to tables that have 'REPLICATION_TEST' in their name"""
    return [t[0] for t in table_list if "REPLICATION_TEST" in t[0]]


settings_location = Path.cwd().parent / "settings_folder"
connection = create_oracle_connection("delius_sandpit", settings_location)

create_all_metadata(
    connection,
    save_folder="delius",
    title="delius_sandpit_test",
    description="Here's a description",
    schema="DELIUS_ANALYTICS_PLATFORM",
    source_bucket="mojap-raw-hist-dev",
    source_folder="hmpps/delius/DELIUS_ANALYTICS_PLATFORM",
    filter_function=table_filter,
)

connection.close()

If you save this in a script called get_metadata.py in /metadata/ folder, you'll end up with this folder structure:

metadata
|-- get_metadata.py
|-- delius
|   |-- delius_sandpit_test.json
|   |-- delius_sandpit_test
|   |   |-- database.json
|   |   |-- table1.json
|   |   |-- table2.json

Step by step

There are 3 steps to getting metadata from a database:

  1. Connect using connect.create_oracle_connection
  2. Make a list of all the tables and filter it to the ones you want - you can do both with table_list.get_table_list
  3. Get metadata from the filtered list of tables using metadata.create_metadata_folder

1. Connecting

You'll need some database connection settings. These should be in a json file structured like this:

{
    "host": "HOST",
    "password": "PASSWORD",
    "port": 1234,
    "service_name": "SERVICE_NAME",
    "user": "database_username"
}

Then pass this file and its location to create_oracle_connection in the connect module. You might also need to use the oracle_client_lib parameter to specify where your Oracle client libraries are. See the cx_Oracle connection problems, above.

2. Making a list of tables

get_table_list in the table_list module will use your connection to get a list of tables from the database.

It will create a timestamped json file listing the tables. Specify the file's save_folder and title when you call the function.

Also specify the database schema to get the tables from.

You might not want the file to list all the tables. In this case, pass a filter_function as well. This should be a function that takes a single argument - a list, in this format:

[("TABLE_NAME", "TABLESPACE"), ("ANOTHER_TABLE_NAME", "TABLESPACE")]

Return the tables you want as a list of table names. So if you only wanted the first one from this list, you'd want to return:

["TABLE_NAME"]

3. Get the metadata for the tables

To read a folder with etl_manager, it must contain a database.json file specifying overall features of the database, plus a json file for each table.

You can create these in one go using metadata.create_metadata_folder, or separately with metadata.create_json_for_database and metadata.create_json_for_tables.

Tests

Unit tests are written for pytest. Run pytest from the root folder to start them.

Where functions involve SQL queries, the unit tests don't check these queries - only the Python surrounding them.

How to update

Update and release new versions using Poetry. Make sure to change the version number in pyproject.toml and describe the change in CHANGELOG.md.

If you've changed any dependencies in pyproject.yaml, run poetry update to update poetry.lock.

Once you've created a release in GitHub, to publish the latest version to PyPI, run:

poetry build
poetry publish -u <username>

Here, you should replace <username> with your PyPI username. To publish to PyPI, you must be an owner of the project.

Licence

MIT Licence

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_engineering_extract_metadata-1.1.2.tar.gz (13.5 kB view hashes)

Uploaded Source

Built Distribution

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page