Skip to main content

A serverless embedded streaming OLAP data pipeline

Project description

hypersync-lancedb-pipe

A serverless embedded streaming OLAP data pipeline that leverages historical blockchain data from Hypersync and mutable columnar storage format lance.

Since Lance is designed to be mutable, it is possible to create an embedded streaming pipeline using the same data source. The main advantage of this streaming approach is that it doesn't require any parquet glob file management. This reduces the complexity of streaming to the same as batch processing. The other main benefit is that LanceDB has tight integration with both polars and duckdb. LanceDB accepts polars dataframes as data inputs, which allows for a more flexible ETL pipeline, allowing polars to be used as a preprocessing tool.

Since LanceDB leverages the Apache Arrow Standard, there is a lot of flexibility to query from ths database - such as querying larger than memory datasets with polars lazyframes and a dataframe API, or using an embedded OLAP engine like duckdb for faster speed and SQL API.

Getting Started

  1. This repository uses rye to manage dependencies and the virtual environment. To install, refer to this link for instructions here.
  2. Once rye is installed, run rye sync to install dependencies and setup the virtual environment, which has a default name of .venv.
  3. Activate the virtual environment with the command source .venv/bin/activate.

Running the Pipeline

There are some script examples in the scripts folder. These examples demonstrate the versatility of the lancedb writer.

  • Run historical_sync.py file to backfill data from a historical block number. Assumes there is no existing table.
  • Run head_sync.py to sync the database to the head of the chain. Assumes existing table exists.
  • Run backfill_sync.py to perform a backfill sync from the earliest block number. Assumes existing table exists.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hypersync_lancedb_pipe-0.1.0.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

hypersync_lancedb_pipe-0.1.0-py3-none-any.whl (3.2 kB view details)

Uploaded Python 3

File details

Details for the file hypersync_lancedb_pipe-0.1.0.tar.gz.

File metadata

File hashes

Hashes for hypersync_lancedb_pipe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9aa2a753611b44a338ae66e83b574e7a2991f17f043d8b32903151b981b10aec
MD5 4ebf2b16f47a4c87e3384182b6828313
BLAKE2b-256 a33d0f31beb04fd2a2e866258b23fea1a3344a3ee660ff7e79fdafb2febc79dc

See more details on using hashes here.

File details

Details for the file hypersync_lancedb_pipe-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hypersync_lancedb_pipe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 40e9d6493d5041c644f8d7cba5d686ab8e22457bdab787eaaea5874ea95bccdf
MD5 53eb4e591eaf76e306c962777191b2dd
BLAKE2b-256 f84c8ee7fe8b52dcc1f794322b7e0316030bf9d17185df5e6bd5a27f899fa136

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page