SQL query abstraction library
Project description
Getting summary statistics out of database tables is often an unstreamlined process. Does one read in millions of rows before doing any work on Python, or run SQL elsewhere and use intermediate CSVs, or write sql strings in python scripts?
The Python package dwopt (Datawarehouse Operator) uses classes with a collection of sql templates to dynamically build and run queries, under a flexible summary query building framework.
Specifically, it features Excel-pivot-table-like API, a collection of dataframe-summary-methods-like API, and a collection of DDL/DML statements, metadata query wrappers.
Supports:
Python 3.10, 3.11.
Windows 10: Sqlite, Postgres, Oracle.
Linux: Sqlite, Postgres.
See the Features and the Walk Through section for examples.
Installation
pip install dwopt
Install the database drivers for the database engines you want to use.
pip install psycopg2 # postgres
pip install psycopg2-binary # in case can't build psycopg2
pip install oracledb # oracle
Features
Walk Through
Run sql frictionlessly using saved credentials
On import, the package gives ready-to-be-used database operator objects with default credentials, which could be saved prior by user to the system keyring using the dwopt.save_url function.
from dwopt import lt
lt.iris()
lt.run('select count(1) from iris')
count
0 150
This enable quick analysis from any Python/Console window:
from dwopt import pg
pg.iris()
pg.qry('iris').valc('species', 'avg(petal_length)')
species n avg(petal_length)
0 sicolor 50 4.260
1 setosa 50 1.462
2 rginica 50 5.552
Alternatively, use the database operator object factory function dwopt.db and the database engine url to access database.
from dwopt import db
d = db("postgresql://dwopt_tester:1234@localhost/dwopt_test")
d.mtcars()
d.run('select count(1) n from mtcars')
n
0 32
Allows for thick mode connection to Oracle database:
from dwopt import db
url = """oracle+oracledb://dwopt_test:1234@localhost:1521/?service_name=XEPDB1
&encoding=UTF-8&nencoding=UTF-8"""
lib_dir = "C:/app/{user_name}/product/21c/dbhomeXE/bin"
o = db(url, thick_mode={"lib_dir": lib_dir})
o.run("select * from dual")
dummy
0 X
See Testing section for package version tested.
Run sql script with text replacement
Use the database operator object’s run method to run sql script file. One could then replace : marked parameters via mappings supplied to the method.
Colon syntax is to be deprecated. A future version will use jinja2 syntax across the board.
from dwopt import pg, make_test_tbl
_ = make_test_tbl(pg)
pg.run(pth = "E:/projects/my_sql_script.sql",
my_run_dte = '2022-03-03',
my_label = '20220303',
threshold = 5)
count
0 137
Above runs the sql stored on E:/projects/my_sql_script.sql as below:
drop table if exists monthly_extract_:my_label;
create table monthly_extract_:my_label as
select * from test
where
dte = to_date(':my_run_dte','YYYY-MM-DD')
and score > :threshold;
select count(1) from monthly_extract_:my_label;
Programatically make simple sql query
The database operator object’s qry method returns the query object. Use it’s list of clause methods to make a simple sql query.
This is not faster than just writing the sql, main usage is to provide flexibility to the summary query building framework.
from dwopt import lt
lt.mtcars()
sql = "select cyl from mtcars group by cyl having count(1) > 10"
q = (
lt.qry('mtcars a')
.select('a.cyl, count(1) n, avg(a.mpg)')
.case('cat', "a.cyl = 8 then 1", els=0)
.join(f'({sql}) b', 'a.cyl = b.cyl', how='inner')
.group_by('a.cyl')
.having('count(1) > 10')
.order_by('n desc')
)
q.run()
cyl n avg(a.mpg) cat
0 8 14 15.100000 1
1 4 11 26.663636 0
q.print()
select a.cyl, count(1) n, avg(a.mpg)
,case when a.cyl = 8 then 1 else 0 end as cat
from mtcars a
inner join (select cyl from mtcars group by cyl having count(1) > 10) b
on a.cyl = b.cyl
group by a.cyl
having count(1) > 10
order by n desc
Templates: Excel-pivot-table-like API
Use the query object and it’s valc method to make and run a value counts summary query with custom groups and calcs, on top of arbituary sub-query, as part of the summary query building framework.
Then call the result dataframe’s pivot method to finalize the pivot table.
from dwopt import lt, make_test_tbl
_ = make_test_tbl(lt)
(
lt.qry('test')
.where('score>0.5', 'dte is not null', 'cat is not null')
.valc('dte,cat', 'avg(score) avgscore, round(sum(amt)/1e3,2) total')
.pivot('dte', 'cat')
)
Result:
cat |
n |
avgscore |
total |
|||
---|---|---|---|---|---|---|
dte |
test |
train |
test |
train |
test |
train |
2022-01-01 |
1140 |
1051 |
2.736275 |
2.800106 |
565.67 |
530.09 |
2022-02-02 |
1077 |
1100 |
2.759061 |
2.748898 |
536.68 |
544.10 |
2022-03-03 |
1037 |
1072 |
2.728527 |
2.743825 |
521.54 |
528.85 |
The final query used can be invoked by the valc method, or logged via standard logging.
with x as (
select * from test
where score>0.5
and dte is not null
and cat is not null
)
select
dte,cat
,count(1) n
,avg(score) avgscore, round(sum(amt)/1e3,2) total
from x
group by dte,cat
order by n desc
Templates: Dataframe-summary-methods-like API
Use the query object and it’s list of summary methods to make and run summary queries on top of arbituary sub-query, as part of the summary query building framework:
from dwopt import pg
pg.iris()
q = pg.qry('iris a').select('a.*').case('cat',
"petal_length > 5 then '5+'",
"petal_length between 2 and 5 then '2-5'",
"petal_length < 2 then '-2'",
)
#Column names:
q.cols()
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species', 'cat']
#Number of distinct combination:
q.dist(['species', 'petal_length'])
count 48
Name: 0, dtype: int64
#Head:
q.head()
sepal_length sepal_width petal_length petal_width species cat
0 5.1 3.5 1.4 0.2 setosa -2
1 4.9 3.0 1.4 0.2 setosa -2
2 4.7 3.2 1.3 0.2 setosa -2
3 4.6 3.1 1.5 0.2 setosa -2
4 5.0 3.6 1.4 0.2 setosa -2
#Length:
q.len()
150
#Min and max value:
q.mimx('petal_length')
max 6.9
min 1.0
Name: 0, dtype: float64
#Top record:
q.top()
sepal_length 5.1
sepal_width 3.5
petal_length 1.4
petal_width 0.2
species setosa
cat -2
Name: 0, dtype: object
#Value count followed by pivot:
q.valc('species, cat').pivot('species','cat','n')
cat -2 2-5 5+
species
rginica NaN 9.0 41.0
setosa 50.0 NaN NaN
sicolor NaN 49.0 1.0
#--All summary methods support output by printing or str:
q.valc('species, cat', out=1)
with x as (
select a.*
,case
when petal_length > 5 then '5+'
when petal_length between 2 and 5 then '2-5'
when petal_length < 2 then '-2'
else NULL
end as cat
from iris a
)
select
species, cat
,count(1) n
from x
group by species, cat
order by n desc
Templates: DDL/DML statements, metadata queries
Use the list of operation methods to make and run some DDL/DML statements with convenient or enhanced functionalities:
import pandas as pd
from dwopt import lt
tbl = pd.DataFrame({'col1': [1, 2], 'col2': ['a', 'b']})
tbl2 = pd.DataFrame({'col1': [1, 3], 'col2': ['a', 'c']})
lt.drop('test')
lt.create('test', col1='int', col2='text')
lt.write(tbl, 'test')
lt.write_nodup(tbl2, 'test', ['col1'], "col1 < 4")
lt.run("select * from test")
col1 col2
0 1 a
1 2 b
2 3 c
lt.drop('test')
lt.cwrite(tbl, 'test')
lt.qry('test').run()
col1 col2
0 1 a
1 2 b
Use the list of metadata methods to make and run some useful metadata queries:
from dwopt import pg
pg.iris()
pg.table_cols('public.iris')
column_name data_type
0 sepal_length real
1 sepal_width real
2 petal_length real
3 petal_width real
4 species character varying
from dwopt import lt
lt.iris()
lt.mtcars()
lt.list_tables().iloc[:,:-1]
type name tbl_name rootpage
0 table iris iris 2
1 table mtcars mtcars 5
Standard logging with reproducible sql
Many of the package’s methods are wired through the standard logging package. In particular, the run method emits sql used as INFO level message. The relevant logger object has standard naming and is called dwopt.dbo.
Example configuration to show logs in console:
import logging
logging.basicConfig(level = logging.INFO)
from dwopt import lt
lt.iris(q=1).valc('species', 'avg(petal_length)')
INFO:dwopt.dbo:dropping table via sqlalchemy: iris
INFO:dwopt.dbo:done
INFO:dwopt.dbo:creating table via sqlalchemy:
INFO:dwopt.dbo:('sepal_length', Column('sepal_length', REAL(), table=<iris>))
INFO:dwopt.dbo:('sepal_width', Column('sepal_width', REAL(), table=<iris>))
INFO:dwopt.dbo:('petal_length', Column('petal_length', REAL(), table=<iris>))
INFO:dwopt.dbo:('petal_width', Column('petal_width', REAL(), table=<iris>))
INFO:dwopt.dbo:('species', Column('species', String(), table=<iris>))
INFO:dwopt.dbo:done
INFO:dwopt.dbo:running:
INSERT INTO iris (sepal_length, sepal_width, petal_length, petal_width, species) VALUES (:sepal_length, :sepal_width, :petal_length, :petal_width, :species)
INFO:dwopt.dbo:args len=150, e.g.
{'sepal_length': 5.1, 'sepal_width': 3.5, 'petal_length': 1.4, 'petal_width': 0.2, 'species': 'setosa'}
INFO:dwopt.dbo:done
INFO:dwopt.dbo:running:
with x as (
select * from iris
)
select
species
,count(1) n
,avg(petal_length)
from x
group by species
order by n desc
INFO:dwopt.dbo:done
species n avg(petal_length)
0 sicolor 50 4.260
1 setosa 50 1.462
2 rginica 50 5.552
Alternatively, to avoid logging info messages from other packages:
import logging
logging.basicConfig()
logging.getLogger('dwopt.dbo').setLevel(logging.INFO)
Example configuration to print on console and store on file with timestamps:
import logging
logging.basicConfig(
format = "%(asctime)s [%(levelname)s] %(message)s"
,handlers=[
logging.FileHandler("E:/projects/logs.log"),
logging.StreamHandler()
]
)
logging.getLogger('dwopt.dbo').setLevel(logging.INFO)
Debug logging:
import logging
logging.basicConfig()
logging.getLogger('dwopt').setLevel(logging.DEBUG)
Sqlalchemy logger can also be used to obtain even more details:
import logging
logging.basicConfig()
logging.getLogger('sqlalchemy.engine').setLevel(logging.INFO)
Development
Installation
Testing, documentation building package:
#venv on linux
sudo apt-get install python3-venv
python3.11 -m venv dwopt_dev
source dwopt_dev/bin/activate
deactivate
#testing
python -m pip install pytest black flake8 tox
#doco and packaging
python -m pip install sphinx sphinx_rtd_theme build twine wheel
#depend
python -m pip install -U sqlalchemy pandas keyring
python -m pip install -U keyrings.alt
python -m pip install -U psycopg2
python -m pip install -U oracledb
#consider
python -m pip install -U psycopg2-binary
python -m pip install -U cx_Oracle
#package
python -m pip install -e .
Testing
Test:
python -m tox
Testing for specific databases. Set up environment based on dwopt.make_test_tbl function notes.
python -m pytest
python -m pytest --db=pg
python -m pytest --db=oc
Test code styles:
flake8 src/dwopt
Databases used for testings are:
Postgres 15 Oracle express 21c
Package versions tested are:
Name: keyring Version: 24.2.0 --- Name: keyrings.alt Version: 5.0.0 --- Name: oracledb Version: 1.3.2 --- Name: pandas Version: 2.0.3 --- Name: psycopg2-binary Version: 2.9.6 --- Name: SQLAlchemy Version: 2.0.19
Documentation
Build document:
cd docs
make html
Test examples across docs:
cd docs
make doctest
Documentation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dwopt-0.0.9.tar.gz
.
File metadata
- Download URL: dwopt-0.0.9.tar.gz
- Upload date:
- Size: 40.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.0rc1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb20f8dcd2f5861a2378023d955fe2cf160bf8f5054216468067b7f107b19002 |
|
MD5 | 178842feb6d01cf4a9042ef5d07bdbd3 |
|
BLAKE2b-256 | 09058d8d6d66e7d667c2099b25ea93c25f9183dbd4a4887222d4da052c62cecb |
File details
Details for the file dwopt-0.0.9-py3-none-any.whl
.
File metadata
- Download URL: dwopt-0.0.9-py3-none-any.whl
- Upload date:
- Size: 34.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.0rc1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ddfa15de98612d36a08098dc8fa5df6b09cdf67d9ed6e5c591ae18a7955fe42 |
|
MD5 | 380990c904d4c9a925672f74ad906fbd |
|
BLAKE2b-256 | 9701c0abfa1ea093c871f99eababab8f732c315cd13e3d2d5025098efd6e8cec |