Skip to main content

A package to help parameterise and macrofy sql queries

Project description

PYDQT - Python Data Query Tool

A project which aides in querying of both local and cloud based relational, tabular data. Some things it can do:

  • run paramterised SQL commands; both strings and templates
  • create and reuse jinja based templates to simplify long, complex queries
  • easily create and store jinja macros
  • run queries remotely on snowflake or locally via duckdb
  • cache results locally for further processing and analysis
  • seamlessly return results as a pandas dataframe
  • test data on single columns or combinations of columns

Installation

You can install from PyPI:

pip install pydqt

or

poetry add pydqt

Or you can fork the github repo

PYDQT works with local data straight out of the box. PYDQT also has support to work with remote Snowflake servers. To enable this, one final step is required - you need to provide your credentials so PYDQT can connect to the remote servers.

When you first import pydqt, it will create a .env file in the project root and you have to fill in the blanks.

To do this use the env_edit utility:

from pydqt import env_edit

env_edit()

A text editor will then open, allowing you to fill in the blanks:

SNOWFLAKE_LOGIN = ''
SNOWFLAKE_ROLE = ''
SNOWFLAKE_DEFAULT_DATABASE = ''
SNOWFLAKE_DEFAULT_SCHEMA = ''
WORKSPACE_ROOT = ''
WORKSPACE_NAME = ''

When done, save and close the file then reload to ensure your changes are loaded into the appropriate environment variables:

from pydqt import env_reload

env_reload()

The SNOWFLAKE credentials are only necessary if you want to query snowflake and the .env should not be committed to a repo. Without the .env variables, PYDQT will still work fine with local data.

Testing

PYDQT comes with some tests, which can be run from within vscode or the command line. They use pytest so if running from vscode, ensure that you configure the pytest framework. From the command line (ensure you're in the root of the project) type:

python -m pytest

If you installed PYDQT from pypi then you will have to change directory to the DQT package directory (probably within site-packages) and run the tests from there.

You should, hopefully see lots of green (ie tests passing). If you do not have the snowflake or looker variables defined (see above) then you will see tests related to those failing but all other tests should pass.

PYDQT Main Class; Query

PYDQT has one main class - Query.

Query requires at least one input - a query, can be sql filename or a direct select query. Upon instantiation, Query creates an object with various fields and methods relating to this query. Note that Query does not automatically run a query but it does automatically compile the query. To run the query you use the .run() method (see examples below). In addition to querying data, Query can also write data to SQL. Using the write_sql() method, users can write the .df property to a table of their choosing (and one is created if it doesn't already exist). See example below.

Once a query has been run, you can also test the data using pre-specifed tests declared in a json file in the json/data_tests subfolder of your current workspace. See example below.

All the examples assume you are using PYDQT from within an interactive python session, such as jupyter or hex. There is a notebooks folder within the project where you can also find examples. You can create your own notebooks here and they will be ignored by git (so long as they don't end in "_example.ipynb")

PYDQT Workspaces

PYDQT encourages the use of using workspaces, which are self-contained directories containing everything that the Query class needs to compile a valid SQL statement. Workspaces consist of two directories:

  • templates: this contains SQL templates, includes and macros
  • cache: this contains cached data and their associated compiled SQL statements

Determining and setting your workspace

Your workspace is determined by the values of the WORKSPACE_ROOT and WORKSPACE_NAME environment variables. WORKSPACE_NAME is a sub-directory of WORKSPACE_ROOT. The default values are a folder named "workspaces" in the folder where ydqt is installed (probably site-packages) and "main" for ROOT and NAME respectively.

This default is probably not what you want. In many cases it is preferable to have your workspaces seperate from the site-packages where pydqt was installed, often it is more convenient to have your workspaces somewhere under $HOME. You can use the env_edit and env_reload utilities to change your workspace but PYDQT provides two shortcut functions to facilitate this; set_workdir and workspace:

from pydqt import set_workspace

set_workspace(root='/tmp', name='research')

If the workspace does not already exist, PYDQT will create the necessary folders for a valid workspace. The workspace function then points PYDQT to the correct workspace. The above commands will result in the following directory in /tmp:

── research
    ├── cache
    │   └── snowflake
    ├── json
    │   ├── data_tests
    │   └── tables
    └── templates
        ├── compiled
        ├── includes
        └── macros
          └── mymacros.jinja

As you can see there are no templates in this workspace as it's just been created. Your own custom templates go in the templates folder. For example, here's the structure for the default "main" workspace that ships with PYDQT:

.
└── main
    ├── cache
    │   └── snowflake
    ├── json
    │   ├── data_tests
    │   └── tables
    └── templates
        ├── compiled
        ├── includes
        └── macros
            └── mymacros.jinja

This is how your workspace should look (though you will probably have many more template .sql files in there).

Query Examples

Example 1: simple parameterized sql query string

# import the Query class
from pydqt import Query, test_data_file_full_path

query = Query(query="select * from {{table}} limit 10;",table=test_data_file_full_path())
query
Class: Query (Data query tool)

sql: sql from string (len: 101)       
template: None
cache: False
df: None
is_cached: False
params: 
  table: <test_data_location>

To see the compiled sql, look at the sql property of the Query object,q:

q.sql

SELECT *
FROM read_csv_auto(<test_data_location>, header=TRUE)
LIMIT 10;

To run the query use .run() or .load()

q.run()  # always runs the query on snowflake
q.load() # will load from cache if it can, otherwise from snowflake 

More specifically:

  • load() will return data from a locally cached .csv file, if present, if not then it will call run()
  • run() will run the query on snowflake and then run() will cache the result in a local csv file.

Both run() and load() also populate the .csv property of the Query object:

q.csv

<location of your workspace>/cache/snowflake/<template_name__args>/data.csv'

and they also populate the .df field of the Query object, which is pandas dataframe of the query result

Example 2: parameterized sql query template

There are example templates in workspaces/main/templates. You can create your own templates in your desired workspace workspaces/main/templates. Feel free to copy the examples into here and hack away. PYDQT searches for templates and any includes in your workspace. Let's use the test.sql template. Here's a preview:

{% extends "base.sql" %}
{% block main %}
{% set this = ({
        "min_query_date": "2022-09-30",
        "max_query_date": "2023-09-30",
    })
%}

with stage as (
select dates, orders, gmv, region, source,
{{ macros.ma('gmv',4,order='dates',partition='region,source') }} as gmv_ma4
from '{{table}}'
where dates>'{{min_query_date |default(this.min_query_date) }}'
...
...

A few things here, relating to jinja. Firstly this template extends a base templates, base.sql which means it inherits some values from there.

There are also some default parameters set by the {% set ... %} clause at the top of template, which are then used by the default filter. This ensures that the sql will compile even if not all parameters are set by the Query object, Although not used by the test.sql, you can also include SQL snippets via {% include ... %}. For more on jinja templating see the jinja docs.

To run the template:

q=Query('test.sql', min_query_date='2023-01-01', table=test_data_file_full_path())
q.run()

will run the query on the test.cs data from '2023-01-01' to '2024-12-31'.

Because PYDQT uses jinja (the same templating technology behind DBT) then the sky is the limit in terms of complexity - for example, you can use for loops, conditionals and macros to concoct sql queries. However, you will always be able to see the compiled SQL, by looking at the .sql property of the Query object:

q.sql

There is also a convenience method, .open(), to open the query in an editor (useful for long queries):

q.sql.open()

Example 3: Macros

Macros are jinja constructs that allow you to write custom functions inside SQL. PYDQT macros has some global macros: sql/templates/macros/macros.jinja and you can create your own local macros in mymacros.jinja (in the macros folder of your workspace, see above). To use macros in your template, you first have to import them at the top of your jinja template, like so:

{% import 'macros.jinja' as macros %}
{% import 'mymacros.jinja' as mymacros %}

base.sql imports these macros so if you extend a template from base.sql then you automatically import them. If importing by hand, PYDQT finds the macro files automatically so you do not need full path names. To actually use them you reference them like so:

select 
    *,
    {{ macros.ma('gmv',4,order='dates',partition='region,source') }} as ma_4
from {{table}}

See the test.sql template for an example of using macros within a template.

Example 4: writing results to a SQL table

Get some data:

query = Query(query="select * from '{{table}}' limit 10;",table=test_data_file_full_path())
query.run()

Now we call the write_sql() method:

query.write_sql("my_table", schema="SCHEMA_NAME", append=False, timestamp=False)

See write_sql help for more details

Example 5: testing data

DQT can test data for simple tests such as non null as well as testing combinations of columns

Get some data:

query = Query(query="select * from '{{table}}';",table=test_data_file_full_path())
query.run()

the test data looks like this:

      dates	  orders	gmv	  region	source
0	2022-01-31	94	584.37	US	css
1	2022-02-28	63	5212.05	US	css
2	2022-03-31	70	2304.68	US	css
3	2022-04-30	60	2604.20	US	css
4	2022-05-31	13	737.94	US	css
...	...	...	...	...	...

Now let's add a contrived example that gives you a flavour for how to combine columns in tests:

query.df['gmv_per_order'] = query.df['gmv']/query.df['orders']

We can use query.test() to test data. First we need a json file which contains our tests. Tests go in the json/data_tests sub-folder of your current workspace:

.
└── main
    ├── cache
    │   └── snowflake
    ├── json
    │   ├── data_tests <-- JSON TEST FILES GO IN HERE
    │   └── tables
    └── templates
        ├── compiled
        ├── includes
        └── macros
            └── mymacros.jinja

For our example, we are going to use the json below, which should result in two passes and one fail (the last one). Also, note in the last test how css is quoted using the backtick quote. All strings should be quoted this way, as the single-quote is reserved for dataframe column names.

{
    "tests":[
        {
            "name": "gmv_per_order_check",
            "assert": "'gmv_per_order'=='gmv'/'orders'"
        },
        {
            "name": "orders_are_non_negative",
            "assert": "'orders'>=0"
        },
        {
          "name": "all_orders_come_from_css",
          "assert": "'source'==`css`"
        }
    ]
}

save this as in the above sub-folder as 'example.json' and run:

query.test('example.json')

The above tests will run and the resulting test report can be found in query.tests:

query.tests

{'example': {'gmv_per_order_check': 'All Passed!',
  'orders_are_non_negative': 'All Passed!',
  'all_orders_come_from_css': {'fails': 378,
   'percentage_fails': 75.0,
   'failed_records':          dates  orders      gmv region  source  gmv_per_order
   21  2022-01-31      98  7306.74     US  direct      74.558571
   22  2022-02-28      82  8150.29     US  direct      99.393780
   23  2022-03-31      37  3478.29     US  direct      94.007838
   24  2022-04-30      50  4803.71     US  direct      96.074200
   25  2022-05-31      23  1773.99     US  direct      77.130000
   ..         ...     ...      ...    ...     ...            ...
   499 2023-05-31      91  2774.43     FR     app      30.488242
   500 2023-06-30      82  8066.01     FR     app      98.365976
   501 2023-07-31      80  6615.54     FR     app      82.694250
   502 2023-08-31      82  1030.11     FR     app      12.562317
   503 2023-09-30      57  3772.19     FR     app      66.178772
   
   [378 rows x 6 columns]}}}

As we can see the last test had many records which failed as they're not from css.

Quality of cached data

Finally, PYDQT has an inbuilt check before loading cached data, that the current compiled SQL matches the SQL that actually produced the cached data. The only data that are cached are produced by templates that reference remote data. Any results produced by queries involving local data will not be cached (users can always save the .df property of the Query object using .to_csv() or some other pandas dataframe "to_" method).

Acknowledgements and contributions

PYDQT was inspired in part by dbt and it's paramterized approach to SQL. PYDQT aims to help bring more rigour to analysis by making calls to the database more DRY and organised, with a clear audit trail of how data was selected.

If anyone wants to contribute then please feel free to fork and PR!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydqt-1.2.8.tar.gz (26.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydqt-1.2.8-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file pydqt-1.2.8.tar.gz.

File metadata

  • Download URL: pydqt-1.2.8.tar.gz
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/21.6.0

File hashes

Hashes for pydqt-1.2.8.tar.gz
Algorithm Hash digest
SHA256 f273bc849ddc80dd17b1415012e9df28388792d6f969a62612d9dc308fd900e1
MD5 3f3febbc76f5bbc1706f03a268f2dfa2
BLAKE2b-256 a51680accd8800909085d021054b4a8dbd9dc00b58965a9447f92f959008a1da

See more details on using hashes here.

File details

Details for the file pydqt-1.2.8-py3-none-any.whl.

File metadata

  • Download URL: pydqt-1.2.8-py3-none-any.whl
  • Upload date:
  • Size: 22.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/21.6.0

File hashes

Hashes for pydqt-1.2.8-py3-none-any.whl
Algorithm Hash digest
SHA256 4d2f36611cfd153fa71759dfff908d5a6ab3d5bb126be45e62f69620eb1b6cf0
MD5 c464e378150a6a13f6d409293153c06a
BLAKE2b-256 673fb57788e85f42fdf56df0822fd480abd65312766fb2c9fe9b8e449fb6d301

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page