Simple library for writing composeable SQL queries

These details have not been verified by PyPI

Project links

Project description

csql - Composeable SQL

csql is a Python library to help you write more manageable SQL queries. You can write your queries as small, self-contained chunks, preview the results without pulling a whole result-set down from your database, then refer to them in future queries.

There are also useful features for handling database parameters properly.

The intended use-case is for data analysis and exploration. There's a particular hacky part of the current implementation that leads me to discourage using this in a productionized application: for that scenario you could still use this to generate your sql in a build step or something.

Example:

from csql import Q, Parameters
import pandas as pd
from datetime import date

con = your_database_connection()

Start with a straightforward query:

p = Parameters(
	created_on = date(2020,1,1)
)
q1 = Q(lambda: f"""
select
	customers.id,
	first(customers.name) as name,
	first(created_on) as created_on,
	sum(sales.value) as sales
from
	customers
	join sales on customers.id = sales.customer_id
where created_on > {p['created_on']}}
group by customers.id
""", p)

print(q1.preview_pd(con))

	id	name	created_on	sales
1	111	John Smith	2020-02-05	32.0
2	112	Amy Zhang	2020-05-01	101.5
3	115	Tran Thanh	2020-03-02	100000.0

The preview will pull down 10 rows to a) sanity-check the result of what you've just written, and b) validate your sql.

Now, try building some new queries that build on your previous queries:

q2 = Q(lambda: f"""
select
	ntile(100) over (order by sales)
		as ntile_100,
	name,
	sales
from {q1}
""")

print(q2.preview_pd(con))

	ntile_100	name	sales
1	29	John Smith	32.0
2	50	Amy Zhang	101.5
3	99	Tran Thanh	100000.0

q3 = Q(lambda: f"""
select
	ntile_100,
	min(sales),
	max(sales)
from {q2}
group by ntile_100
order by ntile_100
""")

# this time, we'll pull the whole result instead of just previewing:
result = pd.read_sql(**q3.pd(), con=con)
print(result)

	ntile_100	min(sales)	max(sales)
28	29	25	33.3
49	50	98	120
98	99	5004	100000.0

Cool! But, how does it work?

The basic idea is to turn your queries into a CTE by keeping track of what builds on top of what. For example, for the last query shown, q3, what actually gets sent to the database is:

with _subQuery0 as (
	select
		customers.id,
		first(customers.name) as name,
		first(created_on) as created_on,
		sum(sales.value) as sales
	from
		customers
		join sales on customers.id = sales.customer_id
	where created_on > :1
	group by customers.id
),
_subQuery1 as (
	select
		ntile(100) over (order by sales)
			as ntile_100,
		name,
		sales
	from {q1}
)
select
	ntile_100,
	min(sales),
	max(sales)
from {q2}
group by ntile_100
order by ntile_100

which is exactly the sort of unmaintainable and undebuggable monstrosity that this library is designed to help you avoid.

Design Notes:

I am perhaps overly optimistic about this, but currently I think this should work with most SQL dialects. It doesn't attempt to parse your SQL, uses CTEs which are widely supported, and passes numeric style parameters. It's also not actually tied to pandas at all - .pd() is just a convenience method to build a dict you can splat into pd.read_sql.

TODO / Future Experiments:

Document the API (for now, just read the tests)
Implement other preview systems than pandas (input wanted! what would be useful for you?)
Finalize API to be as ergonomic as possible for interactive use (input wanted!)
Implement some way of actually storing previous results, e.g. into temp tables. (uh oh, this would need DB-specific awareness :( )
Consider other ways of nicely specifying dependent queries that don't rely on AST-parsing magic (e.g. maybe overloading add and radd? "select 1 from " + q1 + "where ..."?)

What was that 'particular hacky part of the current implementation'?

Glad you asked. Consider what goes on here:

Q(lambda: f"""select 1 from {otherQuery}""")

We need to

get the SQL in a way that we can inject dependencies in to it
get the dependent queries
build the dependent queries into a CTE, and inject the names of previous clauses in this CTE into later ones as appropriate
not go down the rabbit hole of actually parsing SQL, this will certainly end in a headache and unsupported SQL dialects, which would be a big deal for the intended use case of interactive analytical queries against analytics DBs.

String interpolation is nearly adequate for this. Unfortunately, Python lacks the ability to "hook in" to string interpolation to the extent that would be required to. For example, in Javascript or Julia, it would be possible to define a string interpolation function or string macro like

Q`select 1 from ${otherQuery}`

and it would satisfy all of the above.

However, Python lacks this ability, and the PEP to add it doesn't look likely to land any time soon, if ever (see: https://www.python.org/dev/peps/pep-0501/)

As a workaround, my current strategy is to require Q to be called with the fstring wrapped in a lambda, run getsource() to get the AST of the lambda, pull out the f-string AST node, and execute it myself, so I can pull out dependent queries and parameters properly. (I think e.g. pytest does similar hackery to get its magic assert diffs.)

This seems to work in most simple cases and the implementation is not complex. However it's still brittle and a bit shit. (in practice, the biggest drawback I've found is it won't work in the REPL.)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.11.0

Jan 27, 2022

0.11.0a2 pre-release

Jan 14, 2022

0.10.0

Jan 10, 2022

0.9.0

Jan 10, 2022

0.9.0a12 pre-release

Jan 6, 2022

0.9.0a6 pre-release

Jan 6, 2022

0.9.0a3 pre-release

Jan 6, 2022

0.8.0

May 26, 2021

0.7.2

Apr 22, 2021

0.7.1

Apr 21, 2021

0.7.0

Apr 21, 2021

0.5.0

Apr 20, 2021

0.4.1

Mar 30, 2021

0.3.2

Dec 16, 2020

0.3.1

Dec 16, 2020

0.3.0

Nov 13, 2020

This version

0.2.4

Oct 23, 2020

0.2.3

Oct 23, 2020

0.2.2

Oct 23, 2020

0.0.0

Oct 23, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csql-0.2.4.tar.gz (13.8 kB view hashes)

Uploaded Oct 23, 2020 Source

Built Distribution

csql-0.2.4-py3-none-any.whl (13.0 kB view hashes)

Uploaded Oct 23, 2020 Python 3

Hashes for csql-0.2.4.tar.gz

Hashes for csql-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`2ff92afb41e6a4f40a9f203d82dd70c9303b7b90e4778448c49e0e4f6c7972fd`
MD5	`bc3e3af023d1a18543ed8173fae70706`
BLAKE2b-256	`59e94a8241d80007cc42e62c09c038316ee0c2b3f3a4670db2b09f4eeb2dd5ef`

Hashes for csql-0.2.4-py3-none-any.whl

Hashes for csql-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`02b4b991654a3d17be3737a70f534f9d2983e1ad6bbe15bf367e3924cbc09f0b`
MD5	`a762389a8cf6e986e52265dceb1a8ad2`
BLAKE2b-256	`0d8f9f45b7ad84902cb35d619711be233f625920345abfe6a0f59523c7665466`