A Python-based Distributed Database
Project description
Sidewinder
Python-based Distributed Database
Sidewinder is a Python-based (with asyncio) Proof-of-Concept Distributed Database that distributes shards of data from the server to a number of workers to "divide and conquer" OLAP database workloads.
It consists of a server, workers, and a client (where you can run interactive SQL commands).
Sidewinder will NOT distribute queries which do not contain aggregates - it will run those on the server side.
Sidewinder uses Apache Arrow with Websockets for communication between the server, worker(s), and client(s).
It uses DuckDB as its SQL execution engine - and the PostgreSQL parser to understand how to combine results from distributed workers.
Setup (to run locally)
Install requirements
Python
Create a new Python 3.8+ virtual environment - from the root of the repo: install the requirements with:
pip install -r requirements.txt
DuckDB CLI
Install DuckDB CLI version 0.6.1 - and make sure the executable is on your PATH.
Platform Downloads:
Linux x86-64
Linux arm64 (aarch64)
MacOS Universal
Generate source sample TPC-H (Scale Factor 1) data
Note: If running on MacOS - you'll need to have homebrew installed, then install coreutils with:
brew install coreutils
After that - you can create sample TPC-H source data for Scale Factor 1 (in parquet format) - run:
scripts/generate_tpch_data.sh 1
Next - you'll need to create a DuckDB database for the server (this is needed for the server to run queries that can't distribute) - run:
scripts/create_duckdb_database.sh 1
Next - you need to generate some shards - in this case we'll just generate 11 shards (we need an odd number for even distribution due to DuckDB's hash function):
pushd shard_generation
python -m build_shard_duckdb --shard-count=11 --source-data-path="../data/tpch/1" --output-data-path="../data/shards/tpch/1"
popd
Run sidewinder locally (from root of repo)
1) Server:
Open a terminal, then:
python -m server
2) Worker:
Open another terminal, then:
python -m worker
Note: you can run up to 11 workers for this example configuration...
3) Client:
Open another terminal, then:
python -m client
Then - while in the client - you can run a sample query that will distribute to the worker(s) (if you have at least one running) - example:
SELECT COUNT(*) FROM lineitem;
Note: if you are running less than 11 workers - your answer will only reflect n/11 of the data (where n is the worker count). We will add delta processing at a later point...
A query that won't distribute (because it does not contain aggregates) - would be:
SELECT * FROM region;
or:
SELECT * FROM lineitem LIMIT 5;
Note: there are TPC-H queries in the tpc-h_queries folder you can run...
To turn distributed mode OFF in the client:
.set distributed = false;
To turn summarization mode OFF in the client (so that sidewinder does NOT summarize the workers' results - this only applies to distributed mode):
.set summarize = false;
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for sidewinder_db-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 115cd27733ac4e86242fb5cf6d9aa221a6a4c23ba05c88d688ca6558175ae2a8 |
|
MD5 | af4cfcb6b25ec2fa04f054e038bf22e9 |
|
BLAKE2b-256 | da62aa4b935f4b6e4b7a056c3c4a6f5992c11f1b56ba88eca531151e03028199 |