No project description provided

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

DharitrI ETL

ETL (extract, transform and load) tools for publishing DharitrI blockchain data (extracted from a standard DharitrI Elasticsearch instance) on Google BigQuery.

Published data

Mainnet

Setup virtual environment

Create a virtual environment and install the dependencies:

python3 -m venv ./venv
source ./venv/bin/activate

pip install -r ./requirements.txt --upgrade
pip install -r ./requirements-dev.txt --upgrade

Run the tests

export PYTHONPATH=.
pytest -m "not integration"

Quickstart

This implementation copies the data from Elasticsearch in two parallel flows.

One flow copies the append-only indices (e.g. blocks, transactions, logs, receipts, etc.) into a staging BQ dataset. This process is incremental, i.e. it only copies the new data since the last run, and it's executed more often than the second flow (every 1 hour, by default). Once the staging database is loaded, the data is transferred to the main BQ dataset, using the Big Query Data Transfers facility.

The second flow copies the mutable indices (e.g. tokens, accounts, etc.) into a staging BQ dataset. This process is not incremental. Tables are truncated and reloaded on each run. Once the staging database is loaded, the data is transferred to the main BQ dataset, using the Big Query Data Transfers facility. This flow is executed less often than the first flow (every 4 hours, by default).

In order to invoke the two processes, you can either use the Docker setup (see next section) or explicitly invoke the following commands:

# First, set the following environment variables:
export GCP_PROJECT_ID=dharitri-blockchain-etl
export WORKSPACE=${HOME}/dharitri-etl/mainnet

# The first flow (for append-only indices):
python3 -m dharitrietl.app process-append-only-indices --workspace=${WORKSPACE} --sleep-between-iterations=3600

# The second flow (for mutable indices):
python3 -m dharitrietl.app process-mutable-indices --workspace=${WORKSPACE} --sleep-between-iterations=86400

Rewinding

Sometimes, errors occur during the ETL process. For the append-only flow, it's recommended to rewind the BQ tables to the latest checkpoint (good state), and re-run the process only after that. This helps to de-duplicate the data beforehand, through a simple data removal. Otherwise, the full data de-duplication step would be employed (performed automatically, after each bulk of tasks, if the data counts from BQ and Elasticsearch do not match), which is more expensive.

To rewind the BQ tables corresponding to the append-only indices to the latest checkpoint, run the following command:

python3 -m dharitrietl.app rewind --workspace=${WORKSPACE}

If the checkpoint is not available or is assumed to be corrupted, one can find the latest good checkpoint by running the following command:

python3 -m dharitrietl.app find-latest-good-checkpoint --workspace=${WORKSPACE}

Docker setup

Build the Docker image:

docker build --network host -f ./docker/Dockerfile -t dharitri-etl:latest .

Set up the containers:

# mainnet
docker compose --file ./docker/docker-compose.yml \
    --env-file ./docker/env/mainnet.env \
    --project-name dharitri-etl-mainnet up --detach

# devnet
docker compose --file ./docker/docker-compose.yml \
    --env-file ./docker/env/devnet.env \
    --project-name dharitri-etl-devnet up --detach

# testnet
docker compose --file ./docker/docker-compose.yml \
    --env-file ./docker/env/testnet.env \
    --project-name dharitri-etl-testnet up --detach

Generate schema files

Maintainers of this repository should trigger a re-generation of the BigQuery schema files whenever the Elasticsearch schema is updated. This is done by running the following command (make sure to check out drt-go-chain-tools in advance):

python3 -m dharitrietl.app regenerate-schema --input-folder=~/drt-go-chain-tools/elasticreindexer/cmd/indices-creator/config/noKibana/ --output-folder=./schema

The resulting files should be committed to this repository.

At times, the load step could fail for some tables due to, say, new fields added to Elasticsearch indices (of which the BigQuery schema was not aware). If so, then re-generate the schema files (see above), update the BigQuery with the bq command (example below is for the tokens table), and restart the ETL flow:

bq update ${GCP_PROJECT_ID}:${BQ_DATASET}.tokens schema/tokens.json

Running integration tests

Generally speaking, the current integration tests should be ran locally (in the future, they might be added in the CI pipeline).

First, connect to the Google Cloud Platform as follows:

gcloud auth application-default login
gcloud config set project dharitri-blockchain-etl
gcloud auth application-default set-quota-project dharitri-blockchain-etl

Then, run the integration tests:

pytest -m "integration"

Management (Google Cloud Console)

Below are a few links useful for managing the ETL process. They are only accessible to the DharitrI team.

BigQuery Workspace: inspect and manage the BigQuery datasets and tables.
Analytics Hub: create and publish data listings.
Logs Explorer: inspect logs.
Monitoring: resource utilization and jobs explorer.
Data Transfers: inspect and manage the data transfers.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.2

Jun 13, 2025

0.0.1

Jun 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dharitri_etl-0.0.2.tar.gz (29.4 kB view details)

Uploaded Jun 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dharitri_etl-0.0.2-py3-none-any.whl (26.6 kB view details)

Uploaded Jun 13, 2025 Python 3

File details

Details for the file dharitri_etl-0.0.2.tar.gz.

File metadata

Download URL: dharitri_etl-0.0.2.tar.gz
Upload date: Jun 13, 2025
Size: 29.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for dharitri_etl-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`e2435738173c13c4d75405da287ba6cabe792d2fc9eb3d6808db4cc4d3c612b7`
MD5	`6c20ad8bfd348b937f94200d4359739f`
BLAKE2b-256	`d65efc778303f6e20f64fc49686b66dce37d03f1bf598952ca60b94ba80bdaf3`

See more details on using hashes here.

File details

Details for the file dharitri_etl-0.0.2-py3-none-any.whl.

File metadata

Download URL: dharitri_etl-0.0.2-py3-none-any.whl
Upload date: Jun 13, 2025
Size: 26.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for dharitri_etl-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d28db3af1d8b888ed07af3e900194c970bb4f701655d38137806de0259f73dd1`
MD5	`15ac08d193e7e174533dd656f9b935e8`
BLAKE2b-256	`ed1a4604fb1c79df83a48703aa581e32c0bdbf355b9eae86bd6e7bf57d668479`

See more details on using hashes here.

dharitri-etl 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DharitrI ETL

Published data

Setup virtual environment

Run the tests

Quickstart

Rewinding

Docker setup

Generate schema files

Running integration tests

Management (Google Cloud Console)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes