Vinum is a SQL query processor for Python, designed for data analysis workflows and in-memory analytics.
Project description
Vinum is a SQL query processor for Python, designed for data analysis workflows and in-memory analytics.
When should I use Vinum?
Vinum is running inside of the host Python process and allows to execute any functions available to the interpreter as UDFs. If you are doing data analysis or running ETL in Python, Vinum allows to execute efficient SQL queries with an ability to call native Python UDFs.
Key Features:
Vinum is running inside of the host Python process and has a hybrid query execution model - whenever possible it would prefer native compiled version of operators and only executes Python interpreted code where strictly necessary (ie. for native Python UDFs).
Allows to use functions available within the host Python interpreter as UDFs, including native Python, NumPy, etc.
Vinum’s execution model doesn’t require input datasets to fit into memory, as it operates on the stream batches. However, final result is fully materialized in memory.
Written in the mix of C++ and Python and built from ground up on top of Apache Arrow, which provides the foundation for moving data and enables minimal overhead for transferring data to and from Numpy and Pandas.
Architecture
Vinum uses PostgresSQL parser provided by pglast project.
Query planner and executor are implemented in Python, while all the physical operators are either implemented in C++ or use compiled vectorized kernels from Arrow or NumPy. The only exception to this is native python UDFs, which are running within interpreted Python.
Query execution model is based on the vectorized model described in the prolific paper by P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, 2005.
Example of a query plan:
Install
pip install vinum
Examples
Query python dict
Create a Table from a python dict and return result of the query as a Pandas DataFrame.
>>> import vinum as vn >>> data = {'value': [300.1, 2.8, 880], 'mode': ['air', 'bus', 'air']} >>> tbl = vn.Table.from_pydict(data) >>> tbl.sql_pd("SELECT value, np.log(value) FROM t WHERE mode='air'") value np.log 0 300.1 5.704116 1 880.0 6.779922
Query pandas dataframe
>>> import pandas as pd >>> import vinum as vn >>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]} >>> pdf = pd.DataFrame(data=data) >>> tbl = vn.Table.from_pandas(pdf) >>> tbl.sql_pd('SELECT * FROM t WHERE col2 > 10 ORDER BY col1 DESC') col1 col2 0 3 17 1 2 13
Run query on a csv stream
For larger datasets or datasets that won’t fit into memory - stream_csv() is the recommended way to execute a query. Compressed files are also supported and can be streamed without prior extraction.
>>> import vinum as vn >>> query = 'select passenger_count pc, count(*) from t group by pc' >>> vn.stream_csv('taxi.csv.bz2').sql(query).to_pandas() pc count 0 0 165 1 5 3453 ...
Read and query csv
>>> import vinum as vn >>> tbl = vn.read_csv('taxi.csv') >>> res_tbl = tbl.sql('SELECT key, fare_amount, passenger_count FROM t ' ... 'WHERE fare_amount > 5 LIMIT 3') >>> res_tbl.to_pandas() key fare_amount passenger_count 0 2010-01-05 16:52:16.0000002 16.9 1 1 2011-08-18 00:35:00.00000049 5.7 2 2 2012-04-21 04:30:42.0000001 7.7 1
Compute Euclidean distance with numpy functions
Use any numpy functions via the ‘np.*’ namespace.
>>> import vinum as vn >>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]}) >>> tbl.sql_pd('SELECT *, np.sqrt(np.square(x) + np.square(y)) dist ' ... 'FROM t ORDER BY dist DESC') x y dist 0 3 17 17.262677 1 2 13 13.152946 2 1 7 7.071068
Compute Euclidean distance with vectorized UDF
Register UDF performing vectorized operations on Numpy arrays.
>>> import vinum as vn >>> vn.register_numpy('distance', ... lambda x, y: np.sqrt(np.square(x) + np.square(y))) >>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]}) >>> tbl.sql_pd('SELECT *, distance(x, y) AS dist ' ... 'FROM t ORDER BY dist DESC') x y dist 0 3 17 17.262677 1 2 13 13.152946 2 1 7 7.071068
Compute Euclidean distance with python UDF
Register Python lambda function as UDF.
>>> import math >>> import vinum as vn >>> vn.register_python('distance', lambda x, y: math.sqrt(x**2 + y**2)) >>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]}) >>> tbl.sql_pd('SELECT x, y, distance(x, y) AS dist FROM t') x y dist 0 1 7 7.071068 1 2 13 13.152946 2 3 17 17.262677
Group by z-score
>>> import numpy as np >>> import vinum as vn >>> def z_score(x: np.ndarray): ... "Compute Standard Score" ... mean = np.mean(x) ... std = np.std(x) ... return (x - mean) / std ... >>> vn.register_numpy('score', z_score) >>> tbl = vn.read_csv('taxi.csv') >>> tbl.sql_pd('select to_int(score(fare_amount)) AS bucket, avg(fare_amount), count(*) ' ... 'FROM t GROUP BY bucket ORDER BY bucket limit 3') bucket avg count_star 0 -1 -1.839000 10 1 0 8.817733 45158 2 1 25.155522 2376
Documentation
What Vinum is not
Vinum is not a Database Management System, there are no plans to support DML and transactions. If you need a DBMS designed for data analytics and OLAP, or don’t need Python UDFs, consider using excellent DuckDB - it is based on a solid scientific foundation and is extremely fast.
Dependencies
Inspiration
Future plans
Support joins and sub-queries.
Consider Gandiva for expression evaluation.
Parallel execution.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file vinum-0.3.2.tar.gz
.
File metadata
- Download URL: vinum-0.3.2.tar.gz
- Upload date:
- Size: 92.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b3fd9297999fe8e4feb695e23460110d6b372763fdbf860e1110c7936e619bb |
|
MD5 | 3d122ebae591211f68af88b4b0654150 |
|
BLAKE2b-256 | 2857d567208dba863cbb3ef6d9f050f4015a4daec614e126d025c95f2197150c |
File details
Details for the file vinum-0.3.2-cp39-cp39-manylinux2014_x86_64.whl
.
File metadata
- Download URL: vinum-0.3.2-cp39-cp39-manylinux2014_x86_64.whl
- Upload date:
- Size: 11.4 MB
- Tags: CPython 3.9
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a07f2c09293f8ab2dc1f4352aa65b0632a11390d23536c3bc817305a9d2a1ab |
|
MD5 | bda69508130407dc805243885186f581 |
|
BLAKE2b-256 | 4c3803affa8383b3cf58fa30a0289b4d46895cfbb92fb179bee75e7e7eddb6f6 |
File details
Details for the file vinum-0.3.2-cp39-cp39-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: vinum-0.3.2-cp39-cp39-macosx_10_9_x86_64.whl
- Upload date:
- Size: 505.9 kB
- Tags: CPython 3.9, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 70ccd1ee2bc1f992bb5219e9d01e355b5be58f6796f225dd186306a53aa27663 |
|
MD5 | b4e77e6da403e838139ca449b0b98be3 |
|
BLAKE2b-256 | 678d70103ec11141c9d890191d1bdbc6f986ecce0da3f8df4d62e055a8e5c6d7 |
File details
Details for the file vinum-0.3.2-cp38-cp38-manylinux2014_x86_64.whl
.
File metadata
- Download URL: vinum-0.3.2-cp38-cp38-manylinux2014_x86_64.whl
- Upload date:
- Size: 11.4 MB
- Tags: CPython 3.8
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c1fe9ceabebe6935fff4e62340487c278839442f47c7529fce1c5e45d0562d87 |
|
MD5 | 64024e9ff2cbd98f73ed6941c0c14ac4 |
|
BLAKE2b-256 | 0162286709f40e330afdef401dfe39945322bfb3ea7f81016f111cde548f975a |
File details
Details for the file vinum-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: vinum-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl
- Upload date:
- Size: 505.6 kB
- Tags: CPython 3.8, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32ef832d457c1efb5a75fd99ec196583916f8dbf59d9a0a72f04278690b69ba4 |
|
MD5 | 9eb7dead7e44c060050c7556cb594fa4 |
|
BLAKE2b-256 | 53f4be6f8253f9ba6dfe5ce544bb0be14accf6884c53e5872342e8412a56a6dd |
File details
Details for the file vinum-0.3.2-cp37-cp37m-manylinux2014_x86_64.whl
.
File metadata
- Download URL: vinum-0.3.2-cp37-cp37m-manylinux2014_x86_64.whl
- Upload date:
- Size: 11.4 MB
- Tags: CPython 3.7m
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d9f93e951a64f675750d20d29984b05c0c90678807f6efc81a29a4d02bfedc8 |
|
MD5 | 4574267a3af1272a38ea7f01e74c835d |
|
BLAKE2b-256 | 79206bfb0e5e4b907467d917999f552abf656af86bd77f28d3342d977893343d |
File details
Details for the file vinum-0.3.2-cp37-cp37m-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: vinum-0.3.2-cp37-cp37m-macosx_10_9_x86_64.whl
- Upload date:
- Size: 500.3 kB
- Tags: CPython 3.7m, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | efe742f590dec4226751943e147282f90bfba1ee601ab91357784b669bf26b80 |
|
MD5 | 416f477aa63e9a7e040006ae9c6428f6 |
|
BLAKE2b-256 | 14f723e80c336dfbd558f008eb255f12403432c1940a84bbe28a8eea045244c1 |