Skip to main content

A Python-to-SQL transpiler to work with relational databases

Project description

Grizzly

Testing coverage report

Grizzly is a transpiler from a Python-API to SQL to move computations from the client into a database system.

Grizzly implements its own DataFrame structure that tracks operations, like projection, filter, joins, ... Only when the result of the sequence of operations is needed, a SQL string is produced, resembling all those operations, and sent to a DBMS. This way, you don't have to care about Out-of-Memory problems, un-optimized queries, and high CPU load.

Installation

Grizzly is available on PyPi: https://pypi.org/project/grizzly-sql

pip3 install --user grizzly-sql

Dependencies

Grizzly uses

Getting started

Import

As with any Python module, just import it

import grizzly

Connection

Connect to your database using an appropriate connection string. In order to load the shipped test database containing events from the GDELT project:

import sqlite3
con = sqlite3.connect("grizzly.db")

Grizzly uses different classes for code generation and executing the produced query. Currently, Grizzly includes a SQL code generator and execution wrapper for relational DBMS (more will follow). In order to activate them, set:

from grizzly.relationaldbexecutor import RelationalExecutor
grizzly.use(RelationalExecutor(con))

The RelationalExecutor constructor has a parameter for the code generator to use. By default this is a grizzly.sqlgenerator.SQLGenerator, but can be set to some own implementation.

Now, reference the table(s) you want to work with:

df = grizzly.read_table("events")

Here, df is just a reference, it contains no data from your table. To show its complete contents, use the show method:

df.show(pretty=True)

This will print the table's content on the screen. Alternatively, you can convert the dataframe into a string using str(df).

In order to collect the result of a query/program into a local list, use df.collect(includeHeader=True)

Filter & Projection

Operations are similar to Pandas:

df[df["globaleventid"] == 470747760] # filter
df = df[["actor1name","actor2name"]] #projection

Joins

A DataFrame can be joined with another DataFrame:

df1 = grizzly.read_table("t1")
df2 = grizzly.read_table("t2")

joined = df1.join(df2, on=["actor1name", "actor2name"], how="inner", comp='=')

In the on parameter, you specify the join columns. The first one is for the left input (df1), the second one for the right input (df2). The how parameter is used to select the join type: inner, left outer, etc. This value is directly placed into the generated query, and thus depends on the dialect of the underlying DBMS. An additional comp parameter lets you choose the comparison operator.

You sometimes want to join on multiple columns with different comparisons. For this, in Grizzly you define the expression as if it was for filters:

df1 = grizzly.read_table("t1")
df2 = grizzly.read_table("t2")

j = df1.join(df2, on = (df1.actor1name == df2.actor2name) | (df1["actor1countrycode"] <= df2["actor2countrycode"]), how="left outer")

This results in the following SQL code:

SELECT * 
FROM t1 _t0 
    left outer JOIN t2 _t1 ON _t0.actor1name = _t1.actor2name or _t0.actor1countrycode <= _t1.actor2countrycode

Grouping & Aggregation

You can also group the data on multiple columns and compute an aggregate over the groups using agg:

from grizzly.aggregates import AggregateType
df = grizzly.read_table("events")
g = df.groupby(["year","actor1name"])

a = g.agg(col="actor2name", aggType=AggregateTyoe.COUNT)

Here, a represents a DataFrame with three columns: year, monthyear and the count value. In the above example, a.generate() will give

SELECT _t0.year, _t0.actor1name, count(_t0.actor2name)
FROM events _t0 
GROUP BY _t0.year, _t0.actor1name

If no aggregation function and projection is used, only the grouping columns are selected upon query generation.

You can apply aggregation functions on non-grouped DataFrames of course. In this case the aggregates will be computed for the whole content. For example, g.count() immediately runs the following query and returns the scalar value

SELECT count(*) FROM (
    SELECT _t0.year, _t0.actor1name
    FROM events _t0 
    GROUP BY _t0.year, _t0.actor1name
    ) _t1

A df.count() (i.e. before the grouping) for the above piece of code will return the single scalar value with the number of records in df (22225). The query executed for this is:

SELECT count(*)
FROM events

Note, currently Grizzly supports predefined aggregations only. They are defined as constants in the AggregateType class: MIN, MAX, MEAN, SUM, COUNT. We are working on adding support for (arbitrary) user-defined functions (see below).

SQL

You can inspect the produced query string (in this case SQL) with generate():

print(df.generate())

Supported operations

  • filter/selection
  • projection
  • join
  • group by
  • aggregation functions: min, max, mean (avg), count, sum

Limitations

  • Currently, only the few operations above are supported -- more is to come
  • Grizzly is under active development and things might change.
  • There are certainly some bugs. Probably with complex queries

Vision

Grizzly is a research project. We aim at bringing data-intensive operations back into the database system. Our plan is to extend Grizzly in the following ways - some of them are inspired by our other projects:

  • Support for heterogeneous data sources:
    • Combine data from different sources (relational DB, file, HDFS, NoSQL) in one program/query (i.e. Polystores, federated query processing)
    • automatically import external data when neccessary
  • Add spatial operations
  • Stream processing operations
  • Code generation
    • Procude native code from the Python API

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grizzly-sql-0.1.3.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grizzly_sql-0.1.3-py2.py3-none-any.whl (11.6 kB view details)

Uploaded Python 2Python 3

File details

Details for the file grizzly-sql-0.1.3.tar.gz.

File metadata

  • Download URL: grizzly-sql-0.1.3.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/49.1.3 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.9.0

File hashes

Hashes for grizzly-sql-0.1.3.tar.gz
Algorithm Hash digest
SHA256 3f4ae3d940b4a47e6a029002f10ef817df5ef7d6ad51eb05b710f31952428fb2
MD5 a58e57b30f151ec7c2eb04eccad4ac88
BLAKE2b-256 bd3c56f1b74f1c10f6a53fded3bd31693d1771c071643814b3816f066a69af32

See more details on using hashes here.

File details

Details for the file grizzly_sql-0.1.3-py2.py3-none-any.whl.

File metadata

  • Download URL: grizzly_sql-0.1.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/49.1.3 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.9.0

File hashes

Hashes for grizzly_sql-0.1.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 0f7946f9580e918e906620ba9ce6e73a69d8f799296ce7d004c717a3972f80dc
MD5 a93a109acbe89410684ff4bff0883953
BLAKE2b-256 695cb2c981ba665dce0cc02774ec54c8b974e9f44f776eba9f74583ed5a6b685

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page