A Python-based Distributed Database
Project description
Sidewinder
Python-based Distributed Database
Note: Sidewinder is experimental - and is not intended for Production workloads.
Sidewinder is a Python-based (with asyncio) Proof-of-Concept Distributed Database that distributes shards of data from the server to a number of workers to "divide and conquer" OLAP database workloads.
It consists of a server, workers, and a client (where you can run interactive SQL commands).
Sidewinder will NOT distribute queries which do not contain aggregates - it will run those on the server side.
Sidewinder uses Apache Arrow with Websockets for communication between the server, worker(s), and client(s).
It uses DuckDB as its SQL execution engine - and the PostgreSQL parser to understand how to combine results from distributed workers.
Setup (to run locally)
Install package
Clone the repo
git clone https://github.com/prmoore77/sidewinder
Python
Create a new Python 3.8+ virtual environment and install sidewinder-db with:
cd sidewinder
# Create the virtual environment
python3 -m venv ./venv
# Activate the virtual environment
. ./venv/bin/activate
# Install Sidewinder-DB
pip install .
Alternative installation from PyPi
pip install sidewinder-db
Bootstrap the environment by creating a sample TPC-H dataset with 11 shards
. ./venv/bin/activate
sidewinder-bootstrap --tpch-scale-factor=1 --shard-count=11
Run sidewinder locally - from root of repo (use --help option on the executables below for option details)
1) Server:
Open a terminal, then:
. ./venv/bin/activate
sidewinder-server
2) Worker:
Open another terminal, then start a single worker with command:
. ./venv/bin/activate
sidewinder-worker
Note: you can run up to 11 workers for this example configuration, to do that do this instead of starting a single-worker:
. ./venv/bin/activate
for x in {1..11}:
do
sidewinder-worker &
done
To kill the workers later - run:
kill $(jobs -p)
3) Client:
Open another terminal, then:
. ./venv/bin/activate
sidewinder-client
Then - while in the client - you can run a sample query that will distribute to the worker(s) (if you have at least one running) - example:
SELECT COUNT(*) FROM lineitem;
Note: if you are running less than 11 workers - your answer will only reflect n/11 of the data (where n is the worker count). We will add delta processing at a later point...
A query that won't distribute (because it does not contain aggregates) - would be:
SELECT * FROM region;
or:
SELECT * FROM lineitem LIMIT 5;
Note: there are TPC-H queries in the tpc-h_queries folder you can run...
To turn distributed mode OFF in the client:
.set distributed = false;
To turn summarization mode OFF in the client (so that sidewinder does NOT summarize the workers' results - this only applies to distributed mode):
.set summarize = false;
Optional DuckDB CLI (use for data QA purposes, etc.)
Install DuckDB CLI version 0.7.1 - and make sure the executable is on your PATH.
Platform Downloads:
Linux x86-64
Linux arm64 (aarch64)
MacOS Universal
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for sidewinder_db-0.0.14-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 154173f69358e4480fd670e6661c25de7ae70b27a4d1bea6639c23c46b5058a5 |
|
MD5 | 65326724c262a3b1e9828f90dfa8dd1f |
|
BLAKE2b-256 | a52b395ba499b100b2ecebf13a42c29e18ae43da7c2cd7523e328b1a693593f9 |