A python package for extracting and formatting table metadata from databases

These details have been verified by PyPI

Maintainers

AlecJohnson calumabarnett darius.nicholson dfuller jhpyke

These details have not been verified by PyPI

Project links

Project description

Extract metadata for data engineering pipelines

This repo lets you extract metadata from a database and shape it into a folder of json files that etl_manager can read.

The create_all_metadata function will do most of the process in one go. See 'quickstart' below for a summary, or more detailed documentation below that. The end result will be:

a json file containing a filtered list of tables
a subfolder for the metadata
in that subfolder, a database.json file with overall metadata for the database
also in that subfolder, another .json file listing the columns and other metadata for each table

Requirements

This runs in Python 3.6+. You'll need to install cx_Oracle 8.0.0+.

Installing cx_Oracle also involves installing its client libraries. You can download it from Oracle or, if you're on Mac, install by Homebrew:

brew tap InstantClientTap/instantclient brew install instantclient-basic

cx_Oracle problems 1: client host name

If you're on a Mac, you might get errors like "cx_Oracle.DatabaseError: ORA-24454: client host name is not set". If you do, you'll need to adjust your hosts file. To do this:

go to system preferences, then sharing, and note the computer name at the top
go to your hard disk, then the etc folder and find the hosts file
back up your hosts file in case anything weird happens
in hosts, find the line that says 127.0.0.1 localhost
under it, add a new line that says 127.0.0.01 computer_name, where computer_name is the one you got from system preferences/sharing
save the hosts file

cx_Oracle problems 2: client library location

Depending on how you installed the Oracle client libraries, the create_oracle_connection function might not work with default parameters.

If it won't connect, try specifying the location of your client libraries using the oracle_client_lib parameter.

Quick start

Here's an example of creating a full metadata folder by using create_all_metadata and a custom function to filter the table list:

from pathlib import Path
from extract_metadata.connect import create_oracle_connection
from extract_metadata.metadata import create_all_metadata


def table_filter(table_list):
    """Takes a list of (table, tablespace) tuples and filters it down to tables that have 'REPLICATION_TEST' in their name"""
    return [t[0] for t in table_list if "REPLICATION_TEST" in t[0]]


settings_location = Path.cwd().parent / "settings_folder"
connection = create_oracle_connection("delius_sandpit", settings_location)

create_all_metadata(
    connection,
    save_folder="delius",
    title="delius_sandpit_test",
    description="Here's a description",
    schema="DELIUS_ANALYTICS_PLATFORM",
    source_bucket="mojap-raw-hist-dev",
    source_folder="hmpps/delius/DELIUS_ANALYTICS_PLATFORM",
    filter_function=table_filter,
)

connection.close()

If you save this in a script called get_metadata.py in /metadata/ folder, you'll end up with this folder structure:

metadata
|-- get_metadata.py
|-- delius
|   |-- delius_sandpit_test.json
|   |-- delius_sandpit_test
|   |   |-- database.json
|   |   |-- table1.json
|   |   |-- table2.json

Step by step

There are 3 steps to getting metadata from a database:

Connect using connect.create_oracle_connection
Make a list of all the tables and filter it to the ones you want - you can do both with table_list.get_table_list
Get metadata from the filtered list of tables using metadata.create_metadata_folder

1. Connecting

You'll need some database connection settings. These should be in a json file structured like this:

{
    "host": "HOST",
    "password": "PASSWORD",
    "port": 1234,
    "service_name": "SERVICE_NAME",
    "user": "database_username"
}

Then pass this file and its location to create_oracle_connection in the connect module. You might also need to use the oracle_client_lib parameter to specify where your Oracle client libraries are. See the cx_Oracle connection problems, above.

2. Making a list of tables

get_table_list in the table_list module will use your connection to get a list of tables from the database.

It will create a timestamped json file listing the tables. Specify the file's save_folder and title when you call the function.

Also specify the database schema to get the tables from.

You might not want the file to list all the tables. In this case, pass a filter_function as well. This should be a function that takes a single argument - a list, in this format:

[("TABLE_NAME", "TABLESPACE"), ("ANOTHER_TABLE_NAME", "TABLESPACE")]

Return the tables you want as a list of table names. So if you only wanted the first one from this list, you'd want to return:

["TABLE_NAME"]

3. Get the metadata for the tables

To read a folder with etl_manager, it must contain a database.json file specifying overall features of the database, plus a json file for each table.

You can create these in one go using metadata.create_metadata_folder, or separately with metadata.create_json_for_database and metadata.create_json_for_tables.

Tests

Unit tests are written for pytest. Run pytest from the root folder to start them.

Where functions involve SQL queries, the unit tests don't check these queries - only the Python surrounding them.

Githooks

This repo comes with some githooks to make standard checks before you commit files to Github. The checks are:

if you're using git-crypt, run git-crypt status and check for unencrypted file warnings
run Black on Python files
run Flake8 on Python files
run yamllint on yaml files

If you want to use these, run this command from the repo's root directory:

git config core.hooksPath githooks

See the data engineering template repo for details.

Licence

MIT Licence

Extract metadata for data engineering pipelines

This repo lets you extract metadata from a database and shape it into a folder of json files that etl_manager can read.

The create_all_metadata function will do most of the process in one go. See 'quickstart' below for a summary, or more detailed documentation below that. The end result will be:

a json file containing a filtered list of tables
a subfolder for the metadata
in that subfolder, a database.json file with overall metadata for the database
also in that subfolder, another .json file listing the columns and other metadata for each table

Requirements

This runs in Python 3.6+. You'll need to install cx_Oracle 8.0.0+.

Installing cx_Oracle also involves installing its client libraries. You can download it from Oracle or, if you're on Mac, install by Homebrew:

brew tap InstantClientTap/instantclient brew install instantclient-basic

cx_Oracle problems 1: client host name

If you're on a Mac, you might get errors like "cx_Oracle.DatabaseError: ORA-24454: client host name is not set". If you do, you'll need to adjust your hosts file. To do this:

go to system preferences, then sharing, and note the computer name at the top
go to your hard disk, then the etc folder and find the hosts file
back up your hosts file in case anything weird happens
in hosts, find the line that says 127.0.0.1 localhost
under it, add a new line that says 127.0.0.01 computer_name, where computer_name is the one you got from system preferences/sharing
save the hosts file

cx_Oracle problems 2: client library location

Depending on how you installed the Oracle client libraries, the create_oracle_connection function might not work with default parameters.

If it won't connect, try specifying the location of your client libraries using the oracle_client_lib parameter.

Quick start

Here's an example of creating a full metadata folder by using create_all_metadata and a custom function to filter the table list:

from pathlib import Path
from extract_metadata.connect import create_oracle_connection
from extract_metadata.metadata import create_all_metadata


def table_filter(table_list):
    """Takes a list of (table, tablespace) tuples and filters it down to tables that have 'REPLICATION_TEST' in their name"""
    return [t[0] for t in table_list if "REPLICATION_TEST" in t[0]]


settings_location = Path.cwd().parent / "settings_folder"
connection = create_oracle_connection("delius_sandpit", settings_location)

create_all_metadata(
    connection,
    save_folder="delius",
    title="delius_sandpit_test",
    description="Here's a description",
    schema="DELIUS_ANALYTICS_PLATFORM",
    source_bucket="mojap-raw-hist-dev",
    source_folder="hmpps/delius/DELIUS_ANALYTICS_PLATFORM",
    filter_function=table_filter,
)

connection.close()

If you save this in a script called get_metadata.py in /metadata/ folder, you'll end up with this folder structure:

metadata
|-- get_metadata.py
|-- delius
|   |-- delius_sandpit_test.json
|   |-- delius_sandpit_test
|   |   |-- database.json
|   |   |-- table1.json
|   |   |-- table2.json

Step by step

There are 3 steps to getting metadata from a database:

Connect using connect.create_oracle_connection
Make a list of all the tables and filter it to the ones you want - you can do both with table_list.get_table_list
Get metadata from the filtered list of tables using metadata.create_metadata_folder

1. Connecting

You'll need some database connection settings. These should be in a json file structured like this:

{
    "host": "HOST",
    "password": "PASSWORD",
    "port": 1234,
    "service_name": "SERVICE_NAME",
    "user": "database_username"
}

2. Making a list of tables

get_table_list in the table_list module will use your connection to get a list of tables from the database.

It will create a timestamped json file listing the tables. Specify the file's save_folder and title when you call the function.

Also specify the database schema to get the tables from.

You might not want the file to list all the tables. In this case, pass a filter_function as well. This should be a function that takes a single argument - a list, in this format:

[("TABLE_NAME", "TABLESPACE"), ("ANOTHER_TABLE_NAME", "TABLESPACE")]

Return the tables you want as a list of table names. So if you only wanted the first one from this list, you'd want to return:

["TABLE_NAME"]

3. Get the metadata for the tables

To read a folder with etl_manager, it must contain a database.json file specifying overall features of the database, plus a json file for each table.

You can create these in one go using metadata.create_metadata_folder, or separately with metadata.create_json_for_database and metadata.create_json_for_tables.

Tests

Unit tests are written for pytest. Run pytest from the root folder to start them.

Where functions involve SQL queries, the unit tests don't check these queries - only the Python surrounding them.

How to update

Update and release new versions using Poetry. Make sure to change the version number in pyproject.toml and describe the change in CHANGELOG.md.

If you've changed any dependencies in pyproject.yaml, run poetry update to update poetry.lock.

Once you've created a release in GitHub, to publish the latest version to PyPI, run:

poetry build
poetry publish -u <username>

Here, you should replace <username> with your PyPI username. To publish to PyPI, you must be an owner of the project.

Licence

MIT Licence

Project details

These details have been verified by PyPI

Maintainers

AlecJohnson calumabarnett darius.nicholson dfuller jhpyke

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.0.0

Mar 5, 2021

1.4.0

Feb 23, 2021

1.3.2

Jan 15, 2021

1.3.1

Jan 13, 2021

1.2.0

Dec 3, 2020

This version

1.1.2

Nov 10, 2020

1.1.1

Nov 10, 2020

1.1.0

Nov 4, 2020

1.0.2

Oct 29, 2020

1.0.1

Oct 27, 2020

1.0.0

Oct 15, 2020

0.2.0

Oct 14, 2020

0.1.2

Sep 22, 2020

0.1.1

Sep 14, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_engineering_extract_metadata-1.1.2.tar.gz (13.5 kB view hashes)

Uploaded Nov 10, 2020 Source

Built Distribution

data_engineering_extract_metadata-1.1.2-py3-none-any.whl (12.7 kB view hashes)

Uploaded Nov 10, 2020 Python 3

Hashes for data_engineering_extract_metadata-1.1.2.tar.gz

Hashes for data_engineering_extract_metadata-1.1.2.tar.gz
Algorithm	Hash digest
SHA256	`f2ccc3b62bbbffeefdf1017a85382d8f1b45606f33f0661e3826c9d306365e4c`
MD5	`0bddf51a54e25b108d35f61c120b8614`
BLAKE2b-256	`7002844d144af37cc002f6e04260a990e7c0eab7f02d00808666f3b7d4f3eae3`

Hashes for data_engineering_extract_metadata-1.1.2-py3-none-any.whl

Hashes for data_engineering_extract_metadata-1.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`06411b553223302f88419df05dac6221f37e5bbf6b082998fbfa9d38a2454c82`
MD5	`e40190be260e72c5d23f00296680b57d`
BLAKE2b-256	`857ff88000802154e432239de2e1c84586f3e6a7e872c64ba91168ac641e6370`

data-engineering-extract-metadata 1.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Extract metadata for data engineering pipelines

Requirements

cx_Oracle problems 1: client host name

cx_Oracle problems 2: client library location

Quick start

Step by step

1. Connecting

2. Making a list of tables

3. Get the metadata for the tables

Tests

Githooks

Licence

Extract metadata for data engineering pipelines

Requirements

cx_Oracle problems 1: client host name

cx_Oracle problems 2: client library location

Quick start

Step by step

1. Connecting

2. Making a list of tables

3. Get the metadata for the tables

Tests

How to update

Licence

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution