Skip to main content

A Python-to-SQL transpiler to work with relational databases

Project description

Grizzly

Testing coverage report

Grizzly is a transpiler from a Python-API to SQL to move computations from the client into a database system.

Grizzly implements its own DataFrame structure that tracks operations, like projection, filter, joins, ... Only when the result of the sequence of operations is needed, a SQL string is produced, resembling all those operations, and sent to a DBMS. This way, you don't have to care about Out-of-Memory problems, un-optimized queries, and high CPU load.

Installation

Grizzly is available on PyPi: https://pypi.org/project/grizzly-sql

pip3 install --user grizzly-sql

Dependencies

Grizzly uses

Getting started

Import

As with any Python module, just import it

import grizzly

Connection

Connect to your database using an appropriate connection string:

import sqlite3
con = sqlite3.connect("grizzly.db")

Now, reference the table(s) you want to work with:

df = grizzly.read_table("events", con)

Here, df is just a reference, it contains no data from your table. To show its contents, use the show method:

df.show()

This will print the table's content on the screen.

Filter & Projection

Operations are similar to Pandas:

df[df["id" == 42]] # filter
df = df[["actor1","actor2"]]

Joins

A DataFrame can be joined with another DataFrame:

df1 = grizzly.read_table("table1")
df2 = grizzly.read_table("table2")

joined = df1.join(df2, on=["joinCol1", "joinCol2"], how="inner", comp='=')

In the on parameter, you specify the join columns. The first one is for the left input (df1), the second one for the right input (df2). The how parameter is used to select the join type: inner, left outer, etc. This value is directly placed into the generated query, and thus depends on the dialect of the underlying DBMS. An additional comp parameter lets you choose the comparison operator.

You sometimes want to join on multiple columns with different comparisons. For this, in Grizzly you define the expression as if it was for filters:

df1 = grizzly.read_table("t1")
df2 = grizzly.read_table("t2")

j = df1.join(df2, on = (df1['a'] == df2['b']) & (df1['c'] <= df2['d']), how="left outer")

This results in the following SQL code:

SELECT  *
FROM t1  
    LEFT OUTER JOIN (SELECT  * FROM t2   ) IOBRD
    ON (t1.a = IOBRD.b) AND (t1.c <= IOBRD.d)

Here, IOBRD is an autogenerated tuple variable needed to avoid disambiguation.

Grouping & Aggregation

You can also group the data on multiple columns and compute an aggregate over the groups:

df = grizzly.read_table("events")
df = df[df['id'] == 42]
g = df.groupby(["year","actor1"])

a = g.count("actor2")

If no aggregation function is used an show() is called, only the grouping columns are selected. You can apply aggregation functions on non-grouped DataFrames of course. In this case the aggregates will be computed for the whole content.

Thus, a.sql() will give

SELECT year, actor1, count(actor2)
FROM events
WHERE id = 42
GROUP BY year, actor1

, whereas df.count() (i.e. before the grouping) for the above piece of code will return the single scalar value with the number of records in df:

SELECT count(*)
FROM events
WHERE id = 42

SQL

You can inspect the produced SQL string with sql():

print(df.sql())

And the output will be

SELECT actor1, actor2
FROM events
WHERE id = 42

Supported operations

  • filter/selection
  • projection
  • join
  • group by
  • aggregation functions: min, max, mean (avg), count, sum

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for grizzly-sql, version 0.1.2
Filename, size File type Python version Upload date Hashes
Filename, size grizzly_sql-0.1.2-py2.py3-none-any.whl (9.1 kB) File type Wheel Python version py2.py3 Upload date Hashes View hashes
Filename, size grizzly-sql-0.1.2.tar.gz (7.1 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page