A package to help parameterise and macrofy sql queries

These details have not been verified by PyPI

Project links

Project description

PYDQT - Python Data Query Tool

A project which aides in querying of both local and cloud based relational, tabular data. Some things it can do:

run paramterised SQL commands; both strings and templates
create and reuse jinja based templates to simplify long, complex queries
easily create and store jinja macros
run queries remotely on snowflake or locally via duckdb
cache results locally for further processing and analysis
seamlessly return results as a pandas dataframe
test data on single columns or combinations of columns

Installation

You can install from PyPI:

pip install pydqt

poetry add pydqt

Or you can fork the github repo

PYDQT works with local data straight out of the box. PYDQT also has support to work with remote Snowflake servers. To enable this, one final step is required - you need to provide your credentials so PYDQT can connect to the remote servers.

When you first import pydqt, it will create a .env file in the project root and you have to fill in the blanks.

To do this use the env_edit utility:

from pydqt import env_edit

env_edit()

A text editor will then open, allowing you to fill in the blanks:

SNOWFLAKE_LOGIN = ''
SNOWFLAKE_ROLE = ''
SNOWFLAKE_DEFAULT_DATABASE = ''
SNOWFLAKE_DEFAULT_SCHEMA = ''
WORKSPACE_ROOT = ''
WORKSPACE_NAME = ''

When done, save and close the file then reload to ensure your changes are loaded into the appropriate environment variables:

from pydqt import env_reload

env_reload()

The SNOWFLAKE credentials are only necessary if you want to query snowflake and the .env should not be committed to a repo. Without the .env variables, PYDQT will still work fine with local data.

Testing

PYDQT comes with some tests, which can be run from within vscode or the command line. They use pytest so if running from vscode, ensure that you configure the pytest framework. From the command line (ensure you're in the root of the project) type:

python -m pytest

If you installed PYDQT from pypi then you will have to change directory to the DQT package directory (probably within site-packages) and run the tests from there.

You should, hopefully see lots of green (ie tests passing). If you do not have the snowflake or looker variables defined (see above) then you will see tests related to those failing but all other tests should pass.

PYDQT Main Class; Query

PYDQT has one main class - Query.

Query requires at least one input - a query, can be sql filename or a direct select query. Upon instantiation, Query creates an object with various fields and methods relating to this query. Note that Query does not automatically run a query but it does automatically compile the query. To run the query you use the .run() method (see examples below). In addition to querying data, Query can also write data to SQL. Using the write_sql() method, users can write the .df property to a table of their choosing (and one is created if it doesn't already exist). See example below.

Once a query has been run, you can also test the data using pre-specifed tests declared in a json file in the json/data_tests subfolder of your current workspace. See example below.

All the examples assume you are using PYDQT from within an interactive python session, such as jupyter or hex. There is a notebooks folder within the project where you can also find examples. You can create your own notebooks here and they will be ignored by git (so long as they don't end in "_example.ipynb")

PYDQT Workspaces

PYDQT encourages the use of using workspaces, which are self-contained directories containing everything that the Query class needs to compile a valid SQL statement. Workspaces consist of two directories:

templates: this contains SQL templates, includes and macros
cache: this contains cached data and their associated compiled SQL statements

Determining and setting your workspace

Your workspace is determined by the values of the WORKSPACE_ROOT and WORKSPACE_NAME environment variables. WORKSPACE_NAME is a sub-directory of WORKSPACE_ROOT. The default values are a folder named "workspaces" in the folder where ydqt is installed (probably site-packages) and "main" for ROOT and NAME respectively.

This default is probably not what you want. In many cases it is preferable to have your workspaces seperate from the site-packages where pydqt was installed, often it is more convenient to have your workspaces somewhere under $HOME. You can use the env_edit and env_reload utilities to change your workspace but PYDQT provides two shortcut functions to facilitate this; set_workdir and workspace:

from pydqt import set_workspace

set_workspace(root='/tmp', name='research')

If the workspace does not already exist, PYDQT will create the necessary folders for a valid workspace. The workspace function then points PYDQT to the correct workspace. The above commands will result in the following directory in /tmp:

── research
    ├── cache
    │   └── snowflake
    ├── json
    │   ├── data_tests
    │   └── tables
    └── templates
        ├── compiled
        ├── includes
        └── macros
          └── mymacros.jinja

As you can see there are no templates in this workspace as it's just been created. Your own custom templates go in the templates folder. For example, here's the structure for the default "main" workspace that ships with PYDQT:

.
└── main
    ├── cache
    │   └── snowflake
    ├── json
    │   ├── data_tests
    │   └── tables
    └── templates
        ├── compiled
        ├── includes
        └── macros
            └── mymacros.jinja

This is how your workspace should look (though you will probably have many more template .sql files in there).

Query Examples

Example 1: simple parameterized sql query string

# import the Query class
from pydqt import Query, test_data_file_full_path

query = Query(query="select * from {{table}} limit 10;",table=test_data_file_full_path())
query

Class: Query (Data query tool)

sql: sql from string (len: 101)       
template: None
cache: False
df: None
is_cached: False
params: 
  table: <test_data_location>

To see the compiled sql, look at the sql property of the Query object,q:

q.sql

SELECT *
FROM read_csv_auto(<test_data_location>, header=TRUE)
LIMIT 10;

To run the query use .run() or .load()

q.run()  # always runs the query on snowflake
q.load() # will load from cache if it can, otherwise from snowflake

More specifically:

load() will return data from a locally cached .csv file, if present, if not then it will call run()
run() will run the query on snowflake and then run() will cache the result in a local csv file.

Both run() and load() also populate the .csv property of the Query object:

q.csv

<location of your workspace>/cache/snowflake/<template_name__args>/data.csv'

and they also populate the .df field of the Query object, which is pandas dataframe of the query result

Example 2: parameterized sql query template

There are example templates in workspaces/main/templates. You can create your own templates in your desired workspace workspaces/main/templates. Feel free to copy the examples into here and hack away. PYDQT searches for templates and any includes in your workspace. Let's use the test.sql template. Here's a preview:

{% extends "base.sql" %}
{% block main %}
{% set this = ({
        "min_query_date": "2022-09-30",
        "max_query_date": "2023-09-30",
    })
%}

with stage as (
select dates, orders, gmv, region, source,
{{ macros.ma('gmv',4,order='dates',partition='region,source') }} as gmv_ma4
from '{{table}}'
where dates>'{{min_query_date |default(this.min_query_date) }}'
...
...

A few things here, relating to jinja. Firstly this template extends a base templates, base.sql which means it inherits some values from there.

There are also some default parameters set by the {% set ... %} clause at the top of template, which are then used by the default filter. This ensures that the sql will compile even if not all parameters are set by the Query object, Although not used by the test.sql, you can also include SQL snippets via {% include ... %}. For more on jinja templating see the jinja docs.

To run the template:

q=Query('test.sql', min_query_date='2023-01-01', table=test_data_file_full_path())
q.run()

will run the query on the test.cs data from '2023-01-01' to '2024-12-31'.

Because PYDQT uses jinja (the same templating technology behind DBT) then the sky is the limit in terms of complexity - for example, you can use for loops, conditionals and macros to concoct sql queries. However, you will always be able to see the compiled SQL, by looking at the .sql property of the Query object:

q.sql

There is also a convenience method, .open(), to open the query in an editor (useful for long queries):

q.sql.open()

Example 3: Macros

Macros are jinja constructs that allow you to write custom functions inside SQL. PYDQT macros has some global macros: sql/templates/macros/macros.jinja and you can create your own local macros in mymacros.jinja (in the macros folder of your workspace, see above). To use macros in your template, you first have to import them at the top of your jinja template, like so:

{% import 'macros.jinja' as macros %}
{% import 'mymacros.jinja' as mymacros %}

base.sql imports these macros so if you extend a template from base.sql then you automatically import them. If importing by hand, PYDQT finds the macro files automatically so you do not need full path names. To actually use them you reference them like so:

select 
    *,
    {{ macros.ma('gmv',4,order='dates',partition='region,source') }} as ma_4
from {{table}}

See the test.sql template for an example of using macros within a template.

Example 4: writing results to a SQL table

Get some data:

query = Query(query="select * from '{{table}}' limit 10;",table=test_data_file_full_path())
query.run()

Now we call the write_sql() method:

query.write_sql("my_table", schema="SCHEMA_NAME", append=False, timestamp=False)

See write_sql help for more details

Example 5: testing data

DQT can test data for simple tests such as non null as well as testing combinations of columns

Get some data:

query = Query(query="select * from '{{table}}';",table=test_data_file_full_path())
query.run()

the test data looks like this:

      dates	  orders	gmv	  region	source
0	2022-01-31	94	584.37	US	css
1	2022-02-28	63	5212.05	US	css
2	2022-03-31	70	2304.68	US	css
3	2022-04-30	60	2604.20	US	css
4	2022-05-31	13	737.94	US	css
...	...	...	...	...	...

Now let's add a contrived example that gives you a flavour for how to combine columns in tests:

query.df['gmv_per_order'] = query.df['gmv']/query.df['orders']

We can use query.test() to test data. First we need a json file which contains our tests. Tests go in the json/data_tests sub-folder of your current workspace:

.
└── main
    ├── cache
    │   └── snowflake
    ├── json
    │   ├── data_tests <-- JSON TEST FILES GO IN HERE
    │   └── tables
    └── templates
        ├── compiled
        ├── includes
        └── macros
            └── mymacros.jinja

For our example, we are going to use the json below, which should result in two passes and one fail (the last one). Also, note in the last test how css is quoted using the backtick quote. All strings should be quoted this way, as the single-quote is reserved for dataframe column names.

{
    "tests":[
        {
            "name": "gmv_per_order_check",
            "assert": "'gmv_per_order'=='gmv'/'orders'"
        },
        {
            "name": "orders_are_non_negative",
            "assert": "'orders'>=0"
        },
        {
          "name": "all_orders_come_from_css",
          "assert": "'source'==`css`"
        }
    ]
}

save this as in the above sub-folder as 'example.json' and run:

query.test('example.json')

The above tests will run and the resulting test report can be found in query.tests:

query.tests

{'example': {'gmv_per_order_check': 'All Passed!',
  'orders_are_non_negative': 'All Passed!',
  'all_orders_come_from_css': {'fails': 378,
   'percentage_fails': 75.0,
   'failed_records':          dates  orders      gmv region  source  gmv_per_order
   21  2022-01-31      98  7306.74     US  direct      74.558571
   22  2022-02-28      82  8150.29     US  direct      99.393780
   23  2022-03-31      37  3478.29     US  direct      94.007838
   24  2022-04-30      50  4803.71     US  direct      96.074200
   25  2022-05-31      23  1773.99     US  direct      77.130000
   ..         ...     ...      ...    ...     ...            ...
   499 2023-05-31      91  2774.43     FR     app      30.488242
   500 2023-06-30      82  8066.01     FR     app      98.365976
   501 2023-07-31      80  6615.54     FR     app      82.694250
   502 2023-08-31      82  1030.11     FR     app      12.562317
   503 2023-09-30      57  3772.19     FR     app      66.178772
   
   [378 rows x 6 columns]}}}

As we can see the last test had many records which failed as they're not from css.

Quality of cached data

Finally, PYDQT has an inbuilt check before loading cached data, that the current compiled SQL matches the SQL that actually produced the cached data. The only data that are cached are produced by templates that reference remote data. Any results produced by queries involving local data will not be cached (users can always save the .df property of the Query object using .to_csv() or some other pandas dataframe "to_" method).

Acknowledgements and contributions

PYDQT was inspired in part by dbt and it's paramterized approach to SQL. PYDQT aims to help bring more rigour to analysis by making calls to the database more DRY and organised, with a clear audit trail of how data was selected.

If anyone wants to contribute then please feel free to fork and PR!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.4.1

Apr 18, 2024

1.4.0

Apr 8, 2024

1.3.9

Apr 5, 2024

1.3.8

Apr 5, 2024

1.3.7

Apr 5, 2024

1.3.6

Apr 4, 2024

1.3.5

Feb 20, 2024

1.3.4

Feb 5, 2024

1.3.3

Feb 2, 2024

1.3.1

Jan 21, 2024

1.3.0

Jan 19, 2024

1.2.12

Jan 18, 2024

1.2.11

Jan 18, 2024

1.2.10

Jan 10, 2024

1.2.9

Jan 10, 2024

This version

1.2.8

Jan 9, 2024

1.2.7

Jan 8, 2024

1.2.6

Jan 8, 2024

1.2.5

Jan 8, 2024

1.2.4

Jan 8, 2024

1.2.3

Jan 8, 2024

1.2.2

Jan 8, 2024

1.2.1

Jan 8, 2024

1.1.11

Jan 8, 2024

1.1.10

Jan 8, 2024

1.1.9

Jan 8, 2024

1.1.8

Jan 8, 2024

1.1.7

Jan 8, 2024

1.1.6

Jan 8, 2024

1.1.5

Jan 8, 2024

1.1.4

Jan 5, 2024

1.1.3

Jan 5, 2024

1.1.2

Jan 5, 2024

1.1.1

Jan 5, 2024

1.1.0

Jan 5, 2024

1.0.9

Jan 4, 2024

1.0.8

Dec 14, 2023

1.0.7

Dec 14, 2023

1.0.6

Dec 14, 2023

1.0.5

Dec 12, 2023

1.0.4

Dec 5, 2023

1.0.3

Dec 5, 2023

1.0.2

Dec 5, 2023

1.0.1

Nov 1, 2023

1.0.0

Oct 30, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydqt-1.2.8.tar.gz (26.4 kB view details)

Uploaded Jan 9, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pydqt-1.2.8-py3-none-any.whl (22.4 kB view details)

Uploaded Jan 9, 2024 Python 3

File details

Details for the file pydqt-1.2.8.tar.gz.

File metadata

Download URL: pydqt-1.2.8.tar.gz
Upload date: Jan 9, 2024
Size: 26.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/21.6.0

File hashes

Hashes for pydqt-1.2.8.tar.gz
Algorithm	Hash digest
SHA256	`f273bc849ddc80dd17b1415012e9df28388792d6f969a62612d9dc308fd900e1`
MD5	`3f3febbc76f5bbc1706f03a268f2dfa2`
BLAKE2b-256	`a51680accd8800909085d021054b4a8dbd9dc00b58965a9447f92f959008a1da`

See more details on using hashes here.

File details

Details for the file pydqt-1.2.8-py3-none-any.whl.

File metadata

Download URL: pydqt-1.2.8-py3-none-any.whl
Upload date: Jan 9, 2024
Size: 22.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/21.6.0

File hashes

Hashes for pydqt-1.2.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4d2f36611cfd153fa71759dfff908d5a6ab3d5bb126be45e62f69620eb1b6cf0`
MD5	`c464e378150a6a13f6d409293153c06a`
BLAKE2b-256	`673fb57788e85f42fdf56df0822fd480abd65312766fb2c9fe9b8e449fb6d301`

See more details on using hashes here.

pydqt 1.2.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PYDQT - Python Data Query Tool

Installation

Testing

PYDQT Main Class; Query

PYDQT Workspaces

Determining and setting your workspace

Query Examples

Example 1: simple parameterized sql query string

Example 2: parameterized sql query template

Example 3: Macros

Example 4: writing results to a SQL table

Example 5: testing data

Quality of cached data

Acknowledgements and contributions

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes