Skip to main content

No project description provided

Project description

DharitrI ETL

ETL (extract, transform and load) tools for publishing DharitrI blockchain data (extracted from a standard DharitrI Elasticsearch instance) on Google BigQuery.

Published data

Mainnet

Setup virtual environment

Create a virtual environment and install the dependencies:

python3 -m venv ./venv
source ./venv/bin/activate

pip install -r ./requirements.txt --upgrade
pip install -r ./requirements-dev.txt --upgrade

Run the tests

export PYTHONPATH=.
pytest -m "not integration"

Quickstart

This implementation copies the data from Elasticsearch in two parallel flows.

One flow copies the append-only indices (e.g. blocks, transactions, logs, receipts, etc.) into a staging BQ dataset. This process is incremental, i.e. it only copies the new data since the last run, and it's executed more often than the second flow (every 1 hour, by default). Once the staging database is loaded, the data is transferred to the main BQ dataset, using the Big Query Data Transfers facility.

The second flow copies the mutable indices (e.g. tokens, accounts, etc.) into a staging BQ dataset. This process is not incremental. Tables are truncated and reloaded on each run. Once the staging database is loaded, the data is transferred to the main BQ dataset, using the Big Query Data Transfers facility. This flow is executed less often than the first flow (every 4 hours, by default).

In order to invoke the two processes, you can either use the Docker setup (see next section) or explicitly invoke the following commands:

# First, set the following environment variables:
export GCP_PROJECT_ID=dharitri-blockchain-etl
export WORKSPACE=${HOME}/dharitri-etl/mainnet

# The first flow (for append-only indices):
python3 -m dharitrietl.app process-append-only-indices --workspace=${WORKSPACE} --sleep-between-iterations=3600

# The second flow (for mutable indices):
python3 -m dharitrietl.app process-mutable-indices --workspace=${WORKSPACE} --sleep-between-iterations=86400

Rewinding

Sometimes, errors occur during the ETL process. For the append-only flow, it's recommended to rewind the BQ tables to the latest checkpoint (good state), and re-run the process only after that. This helps to de-duplicate the data beforehand, through a simple data removal. Otherwise, the full data de-duplication step would be employed (performed automatically, after each bulk of tasks, if the data counts from BQ and Elasticsearch do not match), which is more expensive.

To rewind the BQ tables corresponding to the append-only indices to the latest checkpoint, run the following command:

python3 -m dharitrietl.app rewind --workspace=${WORKSPACE}

If the checkpoint is not available or is assumed to be corrupted, one can find the latest good checkpoint by running the following command:

python3 -m dharitrietl.app find-latest-good-checkpoint --workspace=${WORKSPACE}

Docker setup

Build the Docker image:

docker build --network host -f ./docker/Dockerfile -t dharitri-etl:latest .

Set up the containers:

# mainnet
docker compose --file ./docker/docker-compose.yml \
    --env-file ./docker/env/mainnet.env \
    --project-name dharitri-etl-mainnet up --detach

# devnet
docker compose --file ./docker/docker-compose.yml \
    --env-file ./docker/env/devnet.env \
    --project-name dharitri-etl-devnet up --detach

# testnet
docker compose --file ./docker/docker-compose.yml \
    --env-file ./docker/env/testnet.env \
    --project-name dharitri-etl-testnet up --detach

Generate schema files

Maintainers of this repository should trigger a re-generation of the BigQuery schema files whenever the Elasticsearch schema is updated. This is done by running the following command (make sure to check out drt-go-chain-tools in advance):

python3 -m dharitrietl.app regenerate-schema --input-folder=~/drt-go-chain-tools/elasticreindexer/cmd/indices-creator/config/noKibana/ --output-folder=./schema

The resulting files should be committed to this repository.

At times, the load step could fail for some tables due to, say, new fields added to Elasticsearch indices (of which the BigQuery schema was not aware). If so, then re-generate the schema files (see above), update the BigQuery with the bq command (example below is for the tokens table), and restart the ETL flow:

bq update ${GCP_PROJECT_ID}:${BQ_DATASET}.tokens schema/tokens.json

Running integration tests

Generally speaking, the current integration tests should be ran locally (in the future, they might be added in the CI pipeline).

First, connect to the Google Cloud Platform as follows:

gcloud auth application-default login
gcloud config set project dharitri-blockchain-etl
gcloud auth application-default set-quota-project dharitri-blockchain-etl

Then, run the integration tests:

pytest -m "integration"

Management (Google Cloud Console)

Below are a few links useful for managing the ETL process. They are only accessible to the DharitrI team.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dharitri_etl-0.0.2.tar.gz (29.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dharitri_etl-0.0.2-py3-none-any.whl (26.6 kB view details)

Uploaded Python 3

File details

Details for the file dharitri_etl-0.0.2.tar.gz.

File metadata

  • Download URL: dharitri_etl-0.0.2.tar.gz
  • Upload date:
  • Size: 29.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for dharitri_etl-0.0.2.tar.gz
Algorithm Hash digest
SHA256 e2435738173c13c4d75405da287ba6cabe792d2fc9eb3d6808db4cc4d3c612b7
MD5 6c20ad8bfd348b937f94200d4359739f
BLAKE2b-256 d65efc778303f6e20f64fc49686b66dce37d03f1bf598952ca60b94ba80bdaf3

See more details on using hashes here.

File details

Details for the file dharitri_etl-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: dharitri_etl-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 26.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for dharitri_etl-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d28db3af1d8b888ed07af3e900194c970bb4f701655d38137806de0259f73dd1
MD5 15ac08d193e7e174533dd656f9b935e8
BLAKE2b-256 ed1a4604fb1c79df83a48703aa581e32c0bdbf355b9eae86bd6e7bf57d668479

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page