Skip to main content

A Python-based Distributed Database

Project description

Sidewinder

sidewinder-ci Supported Python Versions PyPI version PyPI Downloads

Python-based Distributed Database

Sidewinder is a Python-based (with asyncio) Proof-of-Concept Distributed Database that distributes shards of data from the server to a number of workers to "divide and conquer" OLAP database workloads.

It consists of a server, workers, and a client (where you can run interactive SQL commands).

Sidewinder will NOT distribute queries which do not contain aggregates - it will run those on the server side.

Sidewinder uses Apache Arrow with Websockets for communication between the server, worker(s), and client(s).

It uses DuckDB as its SQL execution engine - and the PostgreSQL parser to understand how to combine results from distributed workers.

Setup (to run locally)

Install package

Clone the repo

git clone https://github.com/prmoore77/sidewinder

Python

Create a new Python 3.8+ virtual environment and install sidewinder-db with:

cd sidewinder
# Create the virtual environment
python3 -m venv ./venv
# Activate the virtual environment
. ./venv/bin/activate
# Install Sidewinder-DB
pip install .

Alternative installation from PyPi

pip install sidewinder-db

DuckDB CLI

Install DuckDB CLI version 0.7.1 - and make sure the executable is on your PATH.

Platform Downloads:
Linux x86-64
Linux arm64 (aarch64)
MacOS Universal

Generate source sample TPC-H (Scale Factor 1) data (only possible from repo currently, not PyPi package)

Note: If running on MacOS - you'll need to have homebrew installed, then install coreutils with:
brew install coreutils

After that - you can create sample TPC-H source data for Scale Factor 1 (in parquet format) - run:

scripts/generate_tpch_data.sh 1

Next - you'll need to create a DuckDB database for the server (this is needed for the server to run queries that can't distribute) - run:

scripts/create_duckdb_database.sh 1

Next - you need to generate some shards - in this case we'll just generate 11 shards (we need an odd number for even distribution due to DuckDB's hash function):

pushd shard_generation
python -m build_shard_duckdb --shard-count=11 --source-data-path="../data/tpch/1" --output-data-path="../data/shards/tpch/1"
popd

Run sidewinder locally - from root of repo (use --help option on the executables below for option details)

Setup

Be sure to activate the virtual environment before running the executables

. ./venv/bin/activate

1) Server:

Open a terminal, then:

sidewinder-server

2) Worker:

Open another terminal, then start a single worker with command:

sidewinder-worker

Note: you can run up to 11 workers for this example configuration, to do that do this instead of starting a single-worker:
for x in {1..11}:
do
  sidewinder-worker &
done

To kill the workers later - run:

kill $(jobs -p)

3) Client:

Open another terminal, then:

sidewinder-client

Then - while in the client - you can run a sample query that will distribute to the worker(s) (if you have at least one running) - example:

SELECT COUNT(*) FROM lineitem;

Note: if you are running less than 11 workers - your answer will only reflect n/11 of the data (where n is the worker count). We will add delta processing at a later point...
A query that won't distribute (because it does not contain aggregates) - would be:

SELECT * FROM region;

or:

SELECT * FROM lineitem LIMIT 5;

Note: there are TPC-H queries in the tpc-h_queries folder you can run...
To turn distributed mode OFF in the client:

.set distributed = false;

To turn summarization mode OFF in the client (so that sidewinder does NOT summarize the workers' results - this only applies to distributed mode):

.set summarize = false;

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sidewinder-db-0.0.9.tar.gz (20.2 kB view hashes)

Uploaded Source

Built Distribution

sidewinder_db-0.0.9-py3-none-any.whl (20.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page