Skip to main content

Vinum is a SQL query processor for Python, designed for data analysis workflows and in-memory analytics.

Project description

PyPi CI Grade_Python Codecov

Vinum is a SQL query processor for Python, designed for data analysis workflows and in-memory analytics.

When should I use Vinum?

Vinum is running inside of the host Python process and allows to execute any functions available to the interpreter as UDFs. If you are doing data analysis or running ETL in Python, Vinum allows to execute efficient SQL queries with an ability to call native Python UDFs.

Key Features:

  • Vinum is running inside of the host Python process and has a hybrid query execution model - whenever possible it would prefer native compiled version of operators and only executes Python interpreted code where strictly necessary (ie. for native Python UDFs).

  • Allows to use functions available within the host Python interpreter as UDFs, including native Python, NumPy, etc.

  • Vinum’s execution model doesn’t require input datasets to fit into memory, as it operates on the stream batches. However, final result is fully materialized in memory.

  • Written in the mix of C++ and Python and built from ground up on top of Apache Arrow, which provides the foundation for moving data and enables minimal overhead for transferring data to and from Numpy and Pandas.

Architecture

https://github.com/dmitrykoval/vinum/raw/main/doc/source/_static/architecture.png

Vinum uses PostgresSQL parser provided by pglast project.

Query planner and executor are implemented in Python, while all the physical operators are either implemented in C++ or use compiled vectorized kernels from Arrow or NumPy. The only exception to this is native python UDFs, which are running within interpreted Python.

Query execution model is based on the vectorized model described in the prolific paper by P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, 2005.

Example of a query plan:

https://github.com/dmitrykoval/vinum/raw/main/doc/source/_static/query.png

Install

pip install vinum

Examples

Query python dict

Create a Table from a python dict and return result of the query as a Pandas DataFrame.

>>> import vinum as vn
>>> data = {'value': [300.1, 2.8, 880], 'mode': ['air', 'bus', 'air']}
>>> tbl = vn.Table.from_pydict(data)
>>> tbl.sql_pd("SELECT value, np.log(value) FROM t WHERE mode='air'")
   value    np.log
0  300.1  5.704116
1  880.0  6.779922

Query pandas dataframe

>>> import pandas as pd
>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> pdf = pd.DataFrame(data=data)
>>> tbl = vn.Table.from_pandas(pdf)
>>> tbl.sql_pd('SELECT * FROM t WHERE col2 > 10 ORDER BY col1 DESC')
   col1  col2
0     3    17
1     2    13

Run query on a csv stream

For larger datasets or datasets that won’t fit into memory - stream_csv() is the recommended way to execute a query. Compressed files are also supported and can be streamed without prior extraction.

>>> import vinum as vn
>>> query = 'select passenger_count pc, count(*) from t group by pc'
>>> vn.stream_csv('taxi.csv.bz2').sql(query).to_pandas()
   pc  count
0   0    165
1   5   3453
...

Read and query csv

>>> import vinum as vn
>>> tbl = vn.read_csv('taxi.csv')
>>> res_tbl = tbl.sql('SELECT key, fare_amount, passenger_count FROM t '
...                   'WHERE fare_amount > 5 LIMIT 3')
>>> res_tbl.to_pandas()
                            key  fare_amount  passenger_count
0   2010-01-05 16:52:16.0000002         16.9                1
1  2011-08-18 00:35:00.00000049          5.7                2
2   2012-04-21 04:30:42.0000001          7.7                1

Compute Euclidean distance with numpy functions

Use any numpy functions via the ‘np.*’ namespace.

>>> import vinum as vn
>>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
>>> tbl.sql_pd('SELECT *, np.sqrt(np.square(x) + np.square(y)) dist '
...            'FROM t ORDER BY dist DESC')
   x   y       dist
0  3  17  17.262677
1  2  13  13.152946
2  1   7   7.071068

Compute Euclidean distance with vectorized UDF

Register UDF performing vectorized operations on Numpy arrays.

>>> import vinum as vn
>>> vn.register_numpy('distance',
...                   lambda x, y: np.sqrt(np.square(x) + np.square(y)))
>>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
>>> tbl.sql_pd('SELECT *, distance(x, y) AS dist '
...            'FROM t ORDER BY dist DESC')
   x   y       dist
0  3  17  17.262677
1  2  13  13.152946
2  1   7   7.071068

Compute Euclidean distance with python UDF

Register Python lambda function as UDF.

>>> import math
>>> import vinum as vn
>>> vn.register_python('distance', lambda x, y: math.sqrt(x**2 + y**2))
>>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
>>> tbl.sql_pd('SELECT x, y, distance(x, y) AS dist FROM t')
   x   y       dist
0  1   7   7.071068
1  2  13  13.152946
2  3  17  17.262677

Group by z-score

>>> import numpy as np
>>> import vinum as vn
>>> def z_score(x: np.ndarray):
...     "Compute Standard Score"
...     mean = np.mean(x)
...     std = np.std(x)
...     return (x - mean) / std
...
>>> vn.register_numpy('score', z_score)
>>> tbl = vn.read_csv('taxi.csv')
>>> tbl.sql_pd('select to_int(score(fare_amount)) AS bucket, avg(fare_amount), count(*) '
...            'FROM t GROUP BY bucket ORDER BY bucket limit 3')
   bucket        avg  count_star
0      -1  -1.839000          10
1       0   8.817733       45158
2       1  25.155522        2376

Documentation

What Vinum is not

Vinum is not a Database Management System, there are no plans to support DML and transactions. If you need a DBMS designed for data analytics and OLAP, or don’t need Python UDFs, consider using excellent DuckDB - it is based on a solid scientific foundation and is extremely fast.

Dependencies

Inspiration

Future plans

  • Support joins and sub-queries.

  • Consider Gandiva for expression evaluation.

  • Parallel execution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vinum-0.3.2.tar.gz (92.0 kB view details)

Uploaded Source

Built Distributions

vinum-0.3.2-cp39-cp39-manylinux2014_x86_64.whl (11.4 MB view details)

Uploaded CPython 3.9

vinum-0.3.2-cp39-cp39-macosx_10_9_x86_64.whl (505.9 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

vinum-0.3.2-cp38-cp38-manylinux2014_x86_64.whl (11.4 MB view details)

Uploaded CPython 3.8

vinum-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl (505.6 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

vinum-0.3.2-cp37-cp37m-manylinux2014_x86_64.whl (11.4 MB view details)

Uploaded CPython 3.7m

vinum-0.3.2-cp37-cp37m-macosx_10_9_x86_64.whl (500.3 kB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

File details

Details for the file vinum-0.3.2.tar.gz.

File metadata

  • Download URL: vinum-0.3.2.tar.gz
  • Upload date:
  • Size: 92.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for vinum-0.3.2.tar.gz
Algorithm Hash digest
SHA256 5b3fd9297999fe8e4feb695e23460110d6b372763fdbf860e1110c7936e619bb
MD5 3d122ebae591211f68af88b4b0654150
BLAKE2b-256 2857d567208dba863cbb3ef6d9f050f4015a4daec614e126d025c95f2197150c

See more details on using hashes here.

File details

Details for the file vinum-0.3.2-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

  • Download URL: vinum-0.3.2-cp39-cp39-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 11.4 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for vinum-0.3.2-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2a07f2c09293f8ab2dc1f4352aa65b0632a11390d23536c3bc817305a9d2a1ab
MD5 bda69508130407dc805243885186f581
BLAKE2b-256 4c3803affa8383b3cf58fa30a0289b4d46895cfbb92fb179bee75e7e7eddb6f6

See more details on using hashes here.

File details

Details for the file vinum-0.3.2-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: vinum-0.3.2-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 505.9 kB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for vinum-0.3.2-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 70ccd1ee2bc1f992bb5219e9d01e355b5be58f6796f225dd186306a53aa27663
MD5 b4e77e6da403e838139ca449b0b98be3
BLAKE2b-256 678d70103ec11141c9d890191d1bdbc6f986ecce0da3f8df4d62e055a8e5c6d7

See more details on using hashes here.

File details

Details for the file vinum-0.3.2-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: vinum-0.3.2-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 11.4 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for vinum-0.3.2-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c1fe9ceabebe6935fff4e62340487c278839442f47c7529fce1c5e45d0562d87
MD5 64024e9ff2cbd98f73ed6941c0c14ac4
BLAKE2b-256 0162286709f40e330afdef401dfe39945322bfb3ea7f81016f111cde548f975a

See more details on using hashes here.

File details

Details for the file vinum-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: vinum-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 505.6 kB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for vinum-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 32ef832d457c1efb5a75fd99ec196583916f8dbf59d9a0a72f04278690b69ba4
MD5 9eb7dead7e44c060050c7556cb594fa4
BLAKE2b-256 53f4be6f8253f9ba6dfe5ce544bb0be14accf6884c53e5872342e8412a56a6dd

See more details on using hashes here.

File details

Details for the file vinum-0.3.2-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: vinum-0.3.2-cp37-cp37m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 11.4 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for vinum-0.3.2-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0d9f93e951a64f675750d20d29984b05c0c90678807f6efc81a29a4d02bfedc8
MD5 4574267a3af1272a38ea7f01e74c835d
BLAKE2b-256 79206bfb0e5e4b907467d917999f552abf656af86bd77f28d3342d977893343d

See more details on using hashes here.

File details

Details for the file vinum-0.3.2-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: vinum-0.3.2-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 500.3 kB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for vinum-0.3.2-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 efe742f590dec4226751943e147282f90bfba1ee601ab91357784b669bf26b80
MD5 416f477aa63e9a7e040006ae9c6428f6
BLAKE2b-256 14f723e80c336dfbd558f008eb255f12403432c1940a84bbe28a8eea045244c1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page