thoth-storages·PyPI

Storage and database adapters available in project Thoth

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

This library provides a library called thoth-storages used in project Thoth. The library exposes core queries and methods for PostgreSQL database as well as adapters for manipulating with Ceph via its S3 compatible API.

Quick Start

Pre-requisites:

make sure you have podman and podman-compose installed. You can install those tools by running dnf install -y podman podman-compose
make sure you are in an environment created with pipenv install --dev

To develop locally the first time:

Have a pg dump that you can retrieve from aws s3
Get the latest PostgreSQL container image from: https://catalog.redhat.com/software/containers/rhel8/postgresql-13/5ffdbdef73a65398111b8362?container-tabs=gti&gti-tabs=red-hat-login
Run podman-compose up to scale up pods for database and pgweb. For more detail, refer to the Running PostgreSQL locally section
Run this command to sync the pg dump into the local database:
```
psql -h localhost -p 5432 --username=postgres < pg_dump.sql
```

Now you are ready to test new queries or create new migrations

If you already have a local database, make sure it is not outdated and rember to follow the Generating migrations and schema adjustment in deployment section before testing any changes.

Installation and Usage

The library can be installed via pip or Pipenv from PyPI:

pipenv install thoth-storages

The library provides a CLI that can assist you with exploring schema and data storing:

thoth-storages --help
# In a cloned repo, run:
PYTHONPATH=. pipenv run python3 thoth-storages --help

You can run prepared test-suite via the following command:

pipenv install --dev
pipenv run python3 setup.py test

Running PostgreSQL locally

You can use docker-compose.yaml present in this repository to run a local PostgreSQL instance, (make sure you installed podman-compose):

$ dnf install -y podman podman-compose
$ # Also available from PyPI: pip install podman-compose
$ podman-compose up

After running the commands above, you should be able to access a local PostgreSQL instance at localhost:5432. This is also the default configuration for PostgreSQL’s adapter that connects to localhost unless KNOWLEDGE_GRAPH_HOST is supplied explicitly (see also other environment variables in the adapter constructor for more info on configuring the connection). The default configuration uses database named postgres which can be accessed using postgres user and postgres password (SSL is disabled).

The provided docker-compose.yaml has also PGweb enabled to enable data exploration using UI. To access it visit localhost:8081.

The provided docker-compose.yaml does not use any volume. After you containers restart, the content will not be available anymore.

You can sync your local instance using pgsql:

$ psql -h localhost -p 5432 --username=postgres < pg_dump.sql

If you would like to experiment with PostgreSQL programmatically, you can use the following code snippet as a starting point:

from thoth.storages import GraphDatabase

graph = GraphDatabase()
graph.connect()
# To clear database:
# graph.drop_all()
# To initialize schema in the graph database:
# graph.initialize_schema()

Generating migrations and schema adjustment in deployment

If you make any changes to data model of the main PostgreSQL database, you need to generate migrations. These migrations state how to adjust already existing database with data in deployments. For this purpose, Alembic migrations are used. Alembic can (partially) automatically detect what has changed and how to adjust already existing database in a deployment.

Alembic uses incremental version control, where each migration is versioned and states how to migrate from previous state of database to the desired next state - these versions are present in alembic/versions directory and are automatically generated with procedure described bellow.

If you make any changes, follow the following steps which will generate version for you:

Make sure your local PostgreSQL instance is running (follow Running PostgreSQL locally instructions above):
```
$ podman-compose up
```

Run Alembic CLI to generate versions for you:

# Make sure you have your environment setup:
# pipenv install --dev
# Make sure you are running the most recent version of schema:
$ PYTHONPATH=. pipenv run alembic upgrade head
# Actually generate a new version:
$ PYTHONPATH=. pipenv run alembic revision --autogenerate -m "Added row to calculate sum of sums which will be divided by 42"

Review migrations generated by Alembic. Note NOT all changes are automatically detected by Alembic.
Make sure generated migrations are part of your pull request so changes are propagated to deployments:
```
$ git add thoth/storages/data/alembic/versions/
```
In a deployment, use Management API and its /graph/initialize endpoint to propagate database schema changes in deployment (Management API has to have recent schema changes present which are populated with new thoth-storages releases).
If running locally and you would like to propagate changes, run the following Alembic command to update migrations to the latest version:
```
$ PYTHONPATH=. pipenv run alembic upgrade head
```
If you would like to update schema programmatically run the following Python code:
```
from thoth.storages import GraphDatabase

graph = GraphDatabase()
graph.connect()
graph.initilize_schema()
```

When updating a deployment, make sure all the components use the same database schema. Metrics exposed from a deployment should state schema version of all the components in a deployment.

Generate schema images

You can use shipped CLI thoth-storages to automatically generate schema images out of the current models:

# First, make sure you have dev packages installed:
$ pipenv install --dev
$ PYTHONPATH=. pipenv run python3 ./thoth-storages generate-schema

The command above will produce an image named schema.png. Check --help to get more info on available options.

If the command above fails with the following exception:

FileNotFoundError: [Errno 2] "dot" not found in path.

make sure you have graphviz package installed:

dnf install -y graphviz

Creating own performance indicators

Performance indicators report performance aspect of a library on Amun and results can be automatically synced if the following procedure is respected.

To create own performance indicator, create a script which tests desired functionality of a library. An example can be matrix multiplication script present in thoth-station/performance repository. This script can be supplied to Dependency Monkey to validate certain combination of libraries in desired runtime and buildtime environment. Please follow instructions on how to create a performance script shown in the README of performance repo.

To create relevant models, adjust thoth/storages/graph/models_performance.py file and add your model. Describe parameters (reported in @parameters section of performance indicator result) and result (reported in @result). The name of class should match name which is reported by performance indicator run.

class PiMatmul(Base, BaseExtension, PerformanceIndicatorBase):
    """A class for representing a matrix multiplication micro-performance test."""

    # Device used during performance indicator run - CPU/GPU/TPU/...
    device = Column(String(128), nullable=False)
    matrix_size = Column(Integer, nullable=False)
    dtype = Column(String(128), nullable=False)
    reps = Column(Integer, nullable=False)
    elapsed = Column(Float, nullable=False)
    rate = Column(Float, nullable=False)

All the models use SQLAchemy. See docs for more info.

Online debugging of queries

You can print to logger all the queries that are performed to a PostgreSQL instance. To do so, set the following environment variable:

export THOTH_STORAGES_DEBUG_QUERIES=1

Memory usage statisticts

You can print information about PostgreSQL adapter together with statistics on the adapter in-memory cache usage to logger (it has to have at least level INFO set). To do so, set the following environment variable:

export THOTH_STORAGES_LOG_STATS=1

These statistics will be printed once the database adapter is destructed.

Automatic backups of Thoth deployment

In each deployment, an automatic knowledge graph backup cronjob is run, usually once a day. Results of automatic backups are stored on Ceph - you can find them in s3://<bucket-name>/<prefix>/<deployment-name>/graph-backup/pg_dump-<timestamp>.sql. Refer to deployment configuration for expansion of parameters in the path.

To create a database instance out of this backup file, run a fresh local PostgreSQL instance and fill it from the backup file:

$ cd thoth-station/storages
$ aws s3 --endpoint <ceph-s3-endpoint> cp s3://<bucket-name>/<prefix>/<deployment-name>/graph-backup/pg_dump-<timestamp> pg_dump-<timestamp>.sql
$ podman-compose up
$ psql -h localhost -p 5432 --username=postgres < pg_dump-<timestamp>.sql
password: <type password "postgres" here>
<logs will show up>

Manual backups of Thoth deployment

You can use pg_dump and psql utilities to create dumps and restore the database content from dumps. This tool is pre-installed in the container image which is running PostgreSQL so the only thing you need to do is execute pg_dump in Thoth’s deployment in a PostgreSQL container to create a dump, use oc cp to retrieve dump (or directly use oc exec and create the dump from the cluster) and subsequently psql to restore the database content. The prerequisite for this is to have access to the running container (edit rights).

# Execute the following commands from the root of this Git repo:
# List PostgreSQL pods running:
$ oc get pod -l name=postgresql
NAME                 READY     STATUS    RESTARTS   AGE
postgresql-1-glwnr   1/1       Running   0          3d
# Open remote shell to the running container in the PostgreSQL pod:
$ oc rsh -t postgresql-1-glwnr bash
# Perform dump of the database:
(cluster-postgres) $ pg_dump > pg_dump-$(date +"%s").sql
(cluster-postgres) $ ls pg_dump-*.sql   # Remember the current dump name
(cluster-postgres) pg_dump-1569491024.sql
(cluster-postgres) $ exit
# Copy the dump to the current dir:
$ oc cp thoth-test-core/postgresql-1-glwnr:/opt/app-root/src/pg_dump-1569491024.sql  .
# Start local PostgreSQL instance:
$ podman-compose up --detach
<logs will show up>
$ psql -h localhost -p 5432 --username=postgres < pg_dump-1569491024.sql
password: <type password "postgres" here>
<logs will show up>

You can ignore error messages related to an owner error like this:

STATEMENT:  ALTER TABLE public.python_software_stack OWNER TO thoth;
ERROR:  role "thoth" does not exist

The PostgreSQL container uses user “postgres” by default which is different from the one run in the cluster (“thoth”). The role assignment will simply not be created but data will be available.

Syncing results of a workflow run in the cluster

Each workflow task in the cluster reports a JSON which states necessary information about the task run (metadata) and actual results. These results of workflow tasks are stored on object storage Ceph via S3 compatible API and later on synced via graph syncs to the knowledge graph. The component responsible for graph syncs is graph-sync-job which is written generic enough to sync any data and report metrics about synced data so you don’t need to provide such logic on each new workload registered in the system. To sync your own results of job results (workload) done in the cluster, implement related syncing logic in the sync.py and register handler in the HANDLERS_MAPPING in the same file. The mapping maps prefix of the document id to the handler (function) which is responsible for syncing data into the knowledge base (please mind signatures of existing syncing functions to automatically integrate with sync_documents function which is called from graph-sync-job).

Query Naming conventions in Thoth

For query naming conventions, please read all the docs in conventions for query name.

Accessing data on Ceph

To access data on Ceph, you need to know aws_access_key_id and aws_secret_access_key credentials of endpoint you are connecting to.

Absolute file path of data you are accessing is constructed as: s3://<bucket_name>/<prefix_name>/<file_path>

There are two ways to initialize the data handler:

Configure environment variables

Variable name

Content

S3_ENDPOINT_URL

Ceph Host name

CEPH_BUCKET

Ceph Bucket name

CEPH_BUCKET_PREFIX

Ceph Prefix

CEPH_KEY_ID

Ceph Key ID

CEPH_SECRET_KEY

Ceph Secret Key
```
from thoth.storages.ceph import CephStore
ceph = CephStore()
```

Variable name	Content
S3_ENDPOINT_URL	Ceph Host name
CEPH_BUCKET	Ceph Bucket name
CEPH_BUCKET_PREFIX	Ceph Prefix
CEPH_KEY_ID	Ceph Key ID
CEPH_SECRET_KEY	Ceph Secret Key

Initialize the object directly with parameters

from thoth.storages.ceph import CephStore
ceph = CephStore(
    key_id=<aws_access_key_id>,
    secret_key=<aws_secret_access_key>,
    prefix=<prefix_name>,
    host=<endpoint_url>,
    bucket=<bucket_name>)

After initialization, you are ready to retrieve data

ceph.connect()

try:
    # For dictionary stored as json
    json_data = ceph.retrieve_document(<file_path>)

    # For general blob
    blob = ceph.retrieve_blob(<file_path>)

except NotFoundError:
    # File does not exist

Accessing Thoth Data on the Operate-First Public Bucket

A public instance of Thoth’s database is available on the Operate-First Public Bucket for external contributors to start developing components of Thoth.

Instructions for accessing the bucket are available in the documentation of the thoth/datasets repository.

Be careful not to store any confidential or valuable information in this bucket as its content can be wiped out at any time.

Project details

Release history Release notifications | RSS feed

This version

0.74.2

May 4, 2023

0.74.1

Jan 18, 2023

0.74.0

Jan 17, 2023

0.73.6

Dec 4, 2022

0.73.5

Oct 5, 2022

0.73.4

Sep 26, 2022

0.73.3

Sep 19, 2022

0.73.2

Sep 16, 2022

0.73.1

Aug 29, 2022

0.73.0

Aug 19, 2022

0.72.2

Aug 19, 2022

0.72.1

May 10, 2022

0.72.0

May 5, 2022

0.71.2

Apr 11, 2022

0.71.1

Mar 11, 2022

0.71.0

Mar 10, 2022

0.70.0

Feb 14, 2022

0.69.0

Feb 7, 2022

0.68.3

Feb 3, 2022

0.68.2

Feb 2, 2022

0.68.1

Jan 27, 2022

0.66.0

Jan 11, 2022

0.65.0

Jan 5, 2022

0.64.0

Jan 4, 2022

0.63.0

Dec 22, 2021

0.62.1

Dec 21, 2021

0.62.0

Dec 21, 2021

0.61.0

Dec 2, 2021

0.60.0

Nov 29, 2021

0.59.0

Nov 19, 2021

0.58.0

Nov 8, 2021

0.57.3

Oct 21, 2021

0.57.2

Oct 18, 2021

0.57.1

Oct 15, 2021

0.57.0

Sep 22, 2021

0.56.0

Sep 14, 2021

0.55.0

Aug 20, 2021

0.54.2

Aug 18, 2021

0.54.1

Aug 4, 2021

0.54.0

Jul 27, 2021

0.53.0

Jul 14, 2021

0.52.1

Jul 12, 2021

0.52.0

Jul 8, 2021

0.51.0

Jul 1, 2021

0.50.0

Jun 29, 2021

0.49.0

Jun 21, 2021

0.48.0

Jun 18, 2021

0.47.0

Jun 17, 2021

0.46.0

Jun 15, 2021

0.45.1

Jun 9, 2021

0.45.0

Jun 7, 2021

0.44.1

Jun 3, 2021

0.44.0

Jun 3, 2021

0.43.0

Jun 1, 2021

0.42.0

May 5, 2021

0.41.0

Apr 29, 2021

0.40.0

Apr 9, 2021

0.39.2

Mar 16, 2021

0.39.1

Mar 12, 2021

0.39.0

Mar 10, 2021

0.38.0

Mar 4, 2021

0.37.0

Feb 15, 2021

0.36.0

Feb 3, 2021

0.35.1

Feb 3, 2021

0.35.0

Feb 2, 2021

0.34.0

Feb 1, 2021

0.33.0

Jan 19, 2021

0.32.0

Jan 14, 2021

0.31.0

Jan 12, 2021

0.30.1

Jan 8, 2021

0.30.0

Jan 4, 2021

0.29.4

Dec 8, 2020

0.29.3

Dec 4, 2020

0.29.2

Dec 1, 2020

0.29.1

Nov 23, 2020

0.29.0

Nov 23, 2020

0.28.0

Nov 23, 2020

0.27.1

Nov 18, 2020

0.27.0

Nov 18, 2020

0.26.1

Nov 10, 2020

0.26.0

Nov 5, 2020

0.25.17

Nov 4, 2020

0.25.16

Oct 30, 2020

0.25.15

Oct 3, 2020

0.25.14

Sep 30, 2020

0.25.13

Sep 29, 2020

0.25.12

Sep 29, 2020

0.25.11

Sep 21, 2020

0.25.10

Sep 16, 2020

0.25.9

Sep 15, 2020

0.25.8

Sep 11, 2020

0.25.7

Sep 10, 2020

0.25.6

Sep 9, 2020

0.25.5

Aug 21, 2020

0.25.4

Aug 21, 2020

0.25.3

Aug 20, 2020

0.25.2

Aug 19, 2020

0.25.1

Aug 17, 2020

0.25.0

Jul 30, 2020

0.24.5

Jul 23, 2020

0.24.4

Jul 17, 2020

0.24.3

Jul 9, 2020

0.24.2

Jul 8, 2020

0.24.1

Jul 8, 2020

0.24.0

Jun 24, 2020

0.23.2

Jun 18, 2020

0.23.1

Jun 18, 2020

0.23.0

Jun 10, 2020

0.22.12

May 29, 2020

0.22.11

May 22, 2020

0.22.10

May 13, 2020

0.22.9

Apr 28, 2020

0.22.8

Apr 27, 2020

0.22.7

Mar 30, 2020

0.22.6

Mar 27, 2020

0.22.5

Mar 20, 2020

0.22.4

Mar 19, 2020

0.22.3

Feb 26, 2020

0.22.2

Feb 13, 2020

0.22.1

Feb 12, 2020

0.22.0

Feb 10, 2020

0.21.11

Jan 27, 2020

0.21.10

Jan 21, 2020

0.21.9

Jan 21, 2020

0.21.8

Jan 20, 2020

0.21.7

Jan 15, 2020

0.21.6

Jan 13, 2020

0.21.5

Jan 13, 2020

0.21.4

Jan 13, 2020

0.21.3

Jan 10, 2020

0.21.2

Jan 10, 2020

0.21.1

Jan 10, 2020

0.21.0

Jan 9, 2020

0.20.6

Jan 7, 2020

0.20.5

Jan 7, 2020

0.20.4

Jan 6, 2020

0.20.3

Jan 6, 2020

0.20.2

Jan 3, 2020

0.20.1

Jan 3, 2020

0.20.0

Jan 2, 2020

0.19.30

Dec 17, 2019

0.19.27

Dec 6, 2019

0.19.26

Dec 5, 2019

0.19.25

Nov 29, 2019

0.19.24

Nov 22, 2019

0.19.23

Nov 21, 2019

0.19.22

Nov 18, 2019

0.19.21

Nov 18, 2019

0.19.19

Nov 13, 2019

0.19.18

Nov 8, 2019

0.19.17

Nov 7, 2019

0.19.15

Nov 4, 2019

0.19.14

Oct 29, 2019

0.19.13

Oct 29, 2019

0.19.12

Oct 25, 2019

0.19.11

Oct 25, 2019

0.19.10

Oct 21, 2019

0.19.9

Sep 30, 2019

0.19.8

Sep 27, 2019

0.19.7

Sep 24, 2019

0.19.6

Sep 23, 2019

0.19.5

Sep 18, 2019

0.19.4

Sep 18, 2019

0.19.3

Sep 17, 2019

0.19.2

Sep 17, 2019

0.19.1

Sep 17, 2019

0.19.0

Sep 17, 2019

0.18.6

Aug 14, 2019

0.18.5

Aug 12, 2019

0.18.4

Aug 8, 2019

0.18.3

Aug 1, 2019

0.18.1

Aug 1, 2019

0.18.0

Jul 31, 2019

0.17.0

Jul 30, 2019

0.16.0

Jul 29, 2019

0.15.2

Jul 23, 2019

0.15.1

Jul 22, 2019

0.15.0

Jul 19, 2019

0.14.8

Jul 16, 2019

0.14.7

Jul 10, 2019

0.14.6

Jul 8, 2019

0.14.5

Jul 8, 2019

0.14.4

Jul 8, 2019

0.14.3

Jun 25, 2019

0.14.2

Jun 24, 2019

0.14.1

Jun 6, 2019

0.14.0

May 28, 2019

0.11.4

May 11, 2019

0.11.3

May 9, 2019

0.11.2

May 8, 2019

0.11.1

May 3, 2019

0.11.0

Apr 24, 2019

0.10.0

Apr 17, 2019

0.9.7

Mar 20, 2019

0.9.6

Feb 14, 2019

0.9.5

Dec 17, 2018

0.9.4

Dec 12, 2018

0.9.3

Dec 3, 2018

0.9.2

Dec 3, 2018

0.9.1

Dec 3, 2018

0.9.0

Nov 28, 2018

0.8.0

Nov 15, 2018

0.7.6

Nov 8, 2018

0.7.5

Nov 8, 2018

0.7.4

Nov 7, 2018

0.7.3

Nov 7, 2018

0.7.2

Oct 31, 2018

0.7.1

Oct 31, 2018

0.7.0

Oct 30, 2018

0.6.0

Oct 22, 2018

0.5.4

Oct 12, 2018

0.5.3

Oct 12, 2018

0.5.2

Sep 3, 2018

0.5.1

Aug 28, 2018

0.5.0

Aug 8, 2018

0.4.0

Aug 8, 2018

0.3.0

Aug 8, 2018

0.2.0

Aug 8, 2018

0.1.1

Jul 27, 2018

0.1.0

Jul 17, 2018

0.0.33

Jul 1, 2018

0.0.32

Jun 30, 2018

0.0.29

May 24, 2018

0.0.28

May 24, 2018

0.0.27

May 17, 2018

0.0.26

May 17, 2018

0.0.25

Apr 26, 2018

0.0.24

Apr 25, 2018

0.0.23

Apr 25, 2018

0.0.22

Apr 25, 2018

0.0.21

Apr 25, 2018

0.0.20

Apr 25, 2018

0.0.19

Apr 20, 2018

0.0.18

Apr 17, 2018

0.0.17

Apr 17, 2018

0.0.16

Apr 16, 2018

0.0.15

Apr 16, 2018

0.0.14

Apr 11, 2018

0.0.13

Mar 29, 2018

0.0.12

Mar 26, 2018

0.0.11

Mar 19, 2018

0.0.10

Mar 15, 2018

0.0.9

Mar 14, 2018

0.0.8

Mar 14, 2018

0.0.7

Mar 14, 2018

0.0.6

Mar 14, 2018

0.0.5

Mar 13, 2018

0.0.4

Mar 2, 2018

0.0.3

Feb 27, 2018

0.0.2

Feb 27, 2018

0.0.1

Feb 26, 2018

0.0.0

Feb 21, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

thoth_storages-0.74.2-py3-none-any.whl (198.9 kB view details)

Uploaded May 4, 2023 Python 3

File details

Details for the file thoth_storages-0.74.2-py3-none-any.whl.

File metadata

Download URL: thoth_storages-0.74.2-py3-none-any.whl
Upload date: May 4, 2023
Size: 198.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for thoth_storages-0.74.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f80541ed8d4019bd77e6f0b9a4364fb9bf7df5c16c4f2afc843a5a18eda76d0f`
MD5	`2ba99ff56c11b74f57f47e7fac727d75`
BLAKE2b-256	`0e073941d693d75969897501746045670153ff6051c5395852a3c1ff8e57ea1b`

See more details on using hashes here.

thoth-storages 0.74.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Quick Start

Installation and Usage

Running PostgreSQL locally

Generating migrations and schema adjustment in deployment

Generate schema images

Creating own performance indicators

Online debugging of queries

Memory usage statisticts

Automatic backups of Thoth deployment

Manual backups of Thoth deployment

Syncing results of a workflow run in the cluster

Query Naming conventions in Thoth

Accessing data on Ceph

Accessing Thoth Data on the Operate-First Public Bucket

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes