A python package to query data via amazon athena and bring it into a pandas df
Project description
pydbtools
This is a simple package that let's you query databases using Amazon Athena and get the s3 path to the athena out (as a csv). This is significantly faster than using the the database drivers so might be a good option when pulling in large data. By default, data is converted into a pandas dataframe with equivalent column data types as the Athena table - see "Meta Data" section below.
Note to use this package you need to be added to the StandardDatabaseAccess IAM Policy on the Analytical Platform. Please contact the team if you require access.
To install...
pip install pydbtools
Or from github...
pip install git+git://github.com/moj-analytical-services/pydbtools.git#egg=pydbtools
package requirements are:
pandas
(preinstalled)boto3
(preinstalled)numpy
(preinstalled)s3fs
gluejobutils
Usage
Most simple way to use pydbtools. This will return a pandas df reprentation of the data (with matching meta data).
import pydbtools as pydb
# Run SQL query and return as a pandas df
df = pydb.read_sql("SELECT * from database.table limit 10000")
df.head()
You might want to cast the data yourself or read all the columns as strings.
import pydbtools as pydb
# Run SQL query and return as a pandas df
df = pydb.read_sql("SELECT * from database.table limit 10000", cols_as_str=True)
df.head()
df.dtypes # all objects
You can also pass additional arguments to the pandas.read_csv that reads the resulting Athena SQL query.
Note you cannot pass dtype as this is specified within the read_sql
function.
import pydbtools as pydb
# pass nrows parameter to pandas.read_csv function
pydb.read_sql("SELECT * from database.table limit 10000", nrows=20)
If you didn't want to read the data into pandas you can run the SQL query and get the s3 path and meta data
of the output using the get_athena_query_response. The data is then read in using boto3
, io
and csv
.
import pydbtools as pydb
import io
import csv
import boto3
response = pydb.get_athena_query_response("SELECT * from database.table limit 10000")
# print out path to athena query output (as a csv)
print(response['s3_path'])
# print out meta data
print(response['meta'])
# Read the csv into a string in memory
s3_resource = boto3.resource('s3')
bucket, key = response['s3_path'].replace("s3://", "").split('/', 1)
obj = s3_resource.Object(bucket, key)
text = obj.get()['Body'].read().decode('utf-8')
# Use csv reader to print the outputting csv
reader = csv.reader(text.split('\n'), delimiter=',')
for row in reader:
print('\t'.join(row))
Meta data
The output from get_athena_query_response(...) is a dictionary one of it's keys is meta
. The meta key is a list where each element in this list is the name (name
) and data type (type
) for each column in your athena query output. For example for this table output:
col1 | col2 |
---|---|
1 | 2018-01-01 |
2 | 2018-01-02 |
... |
Would have a meta like:
for m in response['meta']:
print(m['name'], m['type'])
output:
> col1 int
> col1 date
The meta types follow those listed as the generic meta data types used in etl_manager. If you want the actual athena meta data instead you can get them instead of the generic meta data types by setting the return_athena_types
input parameter to True
e.g.
response = pydb.get_athena_query_response("SELECT * from database.table limit 10000", return_athena_types=True)
print(response['meta'])
If you wish to read your SQL query directly into a pandas dataframe you can use the read_sql function. You can apply *args
or **kwargs
into this function which are passed down to pd.read_csv()
.
import pydbtools as pydb
df = pydb.read_sql("SELECT * FROM database.table limit 1000")
df.head()
Meta data conversion
Below is a table that explains what the conversion is from our data types to a pandas df (using the read_sql
function):
data type | pandas column type | Comment |
---|---|---|
character | object | see here |
int | np.float64 | Pandas integers do not allow nulls so using floats |
long | np.float64 | Pandas integers do not allow nulls so using floats |
date | pandas timestamp | |
datetime | pandas timestamp | |
boolean | np.bool | |
float | np.float64 | |
double | np.float64 | |
decimal | np.float64 |
Unit tests
Unit tests run in unittest through Poetry. Run poetry run python -m unittest
to activate them. If you've changed any dependencies, run poetry update
first.
The tests run against a test Glue database callled dbtools
. They use data stored on s3 in alpha-dbtools-test-bucket
.
Notes:
- Amazon Athena using a flavour of SQL called presto docs can be found here
- To query a date column in Athena you need to specify that your value is a date e.g.
SELECT * FROM db.table WHERE date_col > date '2018-12-31'
- To query a datetime or timestamp column in Athena you need to specify that your value is a timestamp e.g.
SELECT * FROM db.table WHERE datetime_col > timestamp '2018-12-31 23:59:59'
- Note dates and datetimes formatting used above. See more specifics around date and datetimes here
- To specify a string in the sql query always use '' not "". Using ""'s means that you are referencing a database, table or col, etc.
- When data is pulled back into rStudio the column types are either R characters (for any col that was a dates, datetimes, characters) or doubles (for everything else).
See changelog for release changes
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pydbtools-2.0.2.tar.gz
.
File metadata
- Download URL: pydbtools-2.0.2.tar.gz
- Upload date:
- Size: 7.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.10 CPython/3.6.12 Darwin/19.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e64a284e50849e316fcc60306c24f01c7e6dcfa0082efcfbbd5905b8878c4178 |
|
MD5 | 9077d1ff5c91de77ec58cbbdf9e9e101 |
|
BLAKE2b-256 | fa504bc185768e336d773e306456583a17ff20142eb0a9a6bf7a07a584666d73 |
File details
Details for the file pydbtools-2.0.2-py3-none-any.whl
.
File metadata
- Download URL: pydbtools-2.0.2-py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.10 CPython/3.6.12 Darwin/19.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59997e9bcfe18ea3153db78c5c5662af7c15010e43eb3fd8cfe1908890d931f4 |
|
MD5 | 838330073c80126a7413911ecd14248a |
|
BLAKE2b-256 | c4ca2ce52f6766779c4c1dbdfe0b7fd12e078fc653328efdf45823d30afada43 |