A selection of tools for easier processing of data using Pandas and AWS

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.6
Topic
- Software Development :: Libraries

Project description

Dativa Tools

Provides useful libraries for processing large data sets. Developed by the team at www.dativa.com as we find them useful in our projects.

The key libraries included here are:

dativa.tools.aws.S3Csv2Parquet - an AWS Glue based tool to transform CSV files to Parquet files
dativa.tools.aws.AthenaClient - provide a simple wrapper to execute Athena queries and create tables. When combined with the S3Csv2Parquet handler can automatically change Athena outputs to Parquet format
dativa.tools.aws.PipelineClient - client to interact with the Pipeline API. When provided an api key, source S3 location, destination s3 location, and rules, it will clean the source file and post it to destination.
dativa.tools.aws.S3Client - a wrapper for AWS's boto library for S3 enabling easier iteration over S3 files and multiple deletions, as well as uploading multiple files
dativa.tools.SQLClient - a wrapper for any PEP249 compliant database client with logging and splitting of queries
dativa.tools.pandas.CSVHandler - improved CSV handling for Pandas
dativa.tools.pandas.ParquetHandler - improved Parquet handling for pandas

There are also some useful support functions for Pandas date and time handling.

Installation

pip install dativatools

Description

dativa.tools.aws.AthenaClient

An easy to use client for AWS Athena that will create tables from S3 buckets (using AWS Glue) and run queries against these tables. It support full customisation of SerDe and column names on table creation.

Examples:

Creating tables

The library creates a temporary Glue crawler which is deleted after use, and will also create the database if it does not exist.

from dativa.tools.aws import AthenaClient
ac = AthenaClient("us-east-1", "my_athena_db")
ac.create_table(table_name='my_first_table',
                crawler_target={'S3Targets': [
                    {'Path': 's3://my-bucket/table-data'}]}
                )

# Create a table with a custom SerDe and column names, typical for CSV files
ac.create_table(table_name='comcast_visio_match',
                crawler_target={'S3Targets': [
                    {'Path': 's3://my-bucket/table-data-2', 'Exclusions': ['**._manifest']}]},
                serde='org.apache.hadoop.hive.serde2.OpenCSVSerde',
                columns=[{'Name': 'id', 'Type': 'string'}, {
                    'Name': 'device_id', 'Type': 'string'}, {'Name': 'subscriber_id', 'Type': 'string'}]
                )

Running queries

from dativa.tools.aws import AthenaClient

ac = AthenaClient("us-east-1", "my_athena_db")
ac.add_query(sql="select * from table",
                 name="My first query",
                 output_location= "s3://my-bucket/query-location/")

ac.wait_for_completion()

Fetch results of query

from dativa.tools.aws import AthenaClient

ac = AthenaClient("us-east-1", "my_athena_db")
query = ac.add_query(sql="select * from table",
                     name="My first query",
                     output_location= "s3://my-bucket/query-location/")

ac.wait_for_completion()
ac.get_query_result(query)

Running queries with the output in Parquet and create an Athena table

from dativa.tools.aws import AthenaClient, S3Csv2Parquet

scp = S3Csv2Parquet(region="us-east-1",
                    template_location="s3://my-bucket/glue-template-path/")
ac = AthenaClient("us-east-1", "my_athena_db", s3_parquet=scp)
ac.add_query(sql="select * from table",
                 name="my query that outputs Parquet",
                 output_location="s3://my-bucket/query-location/",
                 parquet=True)

ac.wait_for_completion()

ac.create_table({'S3Targets': [{'Path': "s3://my-bucket/query-location/"}]},
                                        table_name="query_location")

dativa.tools.aws.S3Client

An easy to use client for AWS S3 that copies data to S3. Examples:

Batch deleting of files on S3

from dativa.tools.aws import S3Client

# Delete all files in a folder
s3 = S3Client()
s3.delete_files(bucket="bucket_name", prefix="/delete-this-folder/")

# Delete only .csv.metadata files in a folder
s3 = S3Client()
s3.delete_files(bucket="bucket_name", prefix="/delete-this-folder/", suffix=".csv.metadata")

Copy files from folder in local filesystem to s3 bucket

from dativa.tools.aws import S3Client

s3 = S3Client()
s3.put_folder(source="/home/user/my_folder", bucket="bucket_name", destination="backup/files")

# Copy all csv files from folder to s3
s3.put_folder(source="/home/user/my_folder", bucket="bucket_name", destination="backup/files", file_format="*.csv")

dativa.tools.SQLClient

A SQL client that wraps any PEP249 compliant connection object and provides detailed logging and simple query execution. In provides the following methods:

execute_query

Runs a query and ignores any output

Parameters:

query - the query to run, either a SQL file or a SQL query
parameters - a dict of parameters to substitute in the query
replace - a dict or items to be replaced in the SQL text
first_to_run - the index of the first query in a mult-command query to be executed

execute_query_to_df

Runs a query and returns the output of the final statement in a DataFrame.

Parameters:

query - the query to run, either a SQL file or a SQL query
parameters - a dict of parameters to substitute in the query
replace - a dict or items to be replaced in the SQL text

def execute_query_to_csv

Runs a query and writes the output of the final statement to a CSV file.

Parameters:

query - the query to run, either a SQL file or a SQL query
csvfile - the file name to save the query results to
parameters - a dict of parameters to substitute in the query
replace - a dict or items to be replaced in the SQL text

Example code

import os
import psycopg2
from dativa.tools import SqlClient

# set up the SQL client from environment variables
sql = SqlClient(psycopg2.connect(
    database=os.environ["DB_NAME"],
    user=os.environ["USER"],
    password=os.environ["PASSWORD"],
    host=os.environ["HOST"],
    port=os.environ["PORT"],
    client_encoding="UTF-8",
    connect_timeout=10))

# create the full schedule table
df = sql.execute_query_to_df(query="sql/my_query.sql",
                             parameters={"start_date": "2018-01-01",
                                         "end_date": "2018-02-01"})

dativa.tools.log_to_stdout

A convenience function to redirect a specific logger and its children to stdout

import logging
from dativa.tools import log_to_stdout

log_to_stdout("dativa.tools", logging.DEBUG)

dativa.tools.pandas.CSVHandler

A wrapper for pandas CSV handling to read and write DataFrames with consistent CSV parameters by sniffing the parameters automatically. Includes reading a CSV into a DataFrame, and writing it out to a string. Files can be read/written from/to local file system or AWS S3.

For S3 access suitable credentials should be available in '~/.aws/credentials' or the AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY environment variables.

CSVHandler

base_path - the base path for any CSV file read, defaults to ""
detect_parameters - whether the encoding of the CSV file should be automatically detected, defaults to False
csv_encoding - the encoding of the CSV files, defaults to UTF-8
csv_delimiter - the delimiter used in the CSV, defaults to ','
csv_header - the index of the header row, or -1 if there is no header
csv_skiprows - the number of rows at the beginning of file to skip
csv_quotechar - the quoting character to use, defaults to "

load_df

Opens a CSV file using the specified configuration for the class and raises an exception if the encoding is unparseable. Detects if base_path is an S3 location and loads data from there if required.

Parameters:

file - File path. Should begin with 's3://' to load from S3 location.
force_dtype - Force data type for data or columns, defaults to None

Returns:

dataframe

save_df

Writes a formatted string from a dataframe using the specified configuration for the class the file. Detects if base_path is an S3 location and saves data there if required.

Parameters:

df - Dataframe to save
file - File path. Should begin with 's3://' to save to an S3 location.

df_to_string

Returns a formatted string from a dataframe using the specified configuration for the class.

Parameters:

df - Dataframe to convert to string

Returns:

string

Example code

from dativa.tools.pandas import CSVHandler

# Create the CSV handler
csv = CSVHandler(base_path='s3://my-bucket-name/')

# Load a file
df = csv.load_df('my-file-name.csv')

# Create a string
str_df = csv.df_to_string(df)

# Save a file
csv.save_df(df, 'another-path/another-file-name.csv')

Support functions for Pandas

dativa.tools.pandas.is_numeric - a function to check whether a series or string is numeric
dativa.tools.pandas.string_to_datetime - a function to convert a string, or series of strings to a datetime, with a strptime date format that supports nanoseconds
dativa.tools.pandas.datetime_to_string - a function to convert a datetime, or a series of datetimes to a string, with a strptime date format that supports nanoseconds
dativa.tools.pandas.format_string_is_valid - a function to confirm whether a strptime format string returns a date
dativa.tools.pandas.get_column_name - a function to return the name of a column from a passed column name or index.
dativa.tools.pandas.get_unique_column_name - a function to return a unique column name when adding new columns to a DataFrame

dativa.tools.pandas.ParquetHandler

ParquetHandler class, specify path of parquet file, and get pandas dataframe for analysis and modification.

param base_path : The base location where the parquet_files are stored.
type base_path : str
param row_group_size : The size of the row groups while writing out the parquet file.
type row_group_size : int
param use_dictionary : Specify whether to use boolean encoding or not
type use_dictionary : bool
param use_deprecated_int96_timestamps : Write nanosecond resolution timestamps to INT96 Parquet format.
type use_deprecated_int96_timestamps : bool
param coerce_timestamps : Cast timestamps a particular resolution. Valid values: {None, 'ms', 'us'}
type coerce_timestamps : str
param compression : Specify the compression codec.
type compression : str

from dativa.tools.pandas import CSVHandler, ParquetHandler

# Read a parquet file
pq_obj = ParquetHandler()
df_parquet = pq_obj.load_df('data.parquet')

# save a csv_file to parquet
csv = CSVHandler(csv_delimiter=",")
df = csv.load_df('emails.csv')
pq_obj = ParquetHandler()
pq_obj.save_df(df, 'emails.parquet')

dativa.tools.aws import S3Csv2Parquet

An easy to use module for converting csv files on s3 to praquet using aws glue jobs. For S3 access and glue access suitable credentials should be available in '~/.aws/credentials' or the AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY environment variables.

S3Csv2Parquet

Parameters:

region - str, AWS region in which glue job is to be run
template_location - str, S3 bucket Folder in which template scripts are located or need to be copied. format s3://bucketname/folder/file.csv
glue_role - str, Name of the glue role which need to be assigned to the Glue Job.
max_jobs - int, default 5 Maximum number of jobs the can run concurrently in the queue
retry_limit - int, default 3 Maximum number of retries allowed per job on failure

convert

Parameters:

csv_path - str or list of str for multiple files, s3 location of the csv file format s3://bucketname/folder/file.csv Pass a list for multiple files
output_folder - str, default set to folder where csv files are located s3 location at which paraquet file should be copied format s3://bucketname/folder
schema - list of tuples, If not specified schema is inferred from the file format [(column1, datatype), (column2, datatype)] Supported datatypes are boolean, double, float, integer, long, null, short, string
name - str, default 'parquet_csv_convert' Name to be assigned to glue job
allocated_capacity - int, default 2 The number of AWS Glue data processing units (DPUs) to allocate to this Job. From 2 to 100 DPUs can be allocated
delete_csv - boolean, default False If set source csv files are deleted post successful completion of job
separator - character, default ',' Delimiter character in csv files
withHeader- int, default 1 Specifies whether to treat the first line as a header Can take values 0 or 1
compression - str, default None If not specified compression is not applied. Can take values snappy, gzip, and lzo
partition_by - list of str, default None List containing columns to partition data by
mode - str, default append Options include: overwrite: will remove data from output_folder before writing out converted file. append: Will write out to output_folder without deleting existing data. ignore: Silently ignore this operation if data already exists.

####Example

from dativa.tools.aws import S3Csv2Parquet

# Initial setup
csv2parquet_obj = S3Csv2Parquet("us-east-1", "s3://my-bucket/templatefolder")

# Create/update a glue job to convert csv files and execute it
csv2parquet_obj.convert("s3://my-bucket/file_to_be_converted_1.csv")
csv2parquet_obj.convert("s3://my-bucket/file_to_be_converted_2.csv")

# Wait for completion of jobs 
csv2parquet_obj.wait_for_completion()

dativa.tools.aws.PipelineClient

PipelineClient class, provide api key, source s3 location, destination s3 location, rules, and get source file cleaned and posted to destination. Refer https://www.dativa.com/tools/dativatools/aws-api/ for more details.

Arguments:

api_key : str, The individual key provided by the pipeline api
source_s3_url : str, The S3 source where the csv files are present
destination_s3_url : str, The S3 destination where the files are to be posted after cleansing

rules : list of dicts OR str, rules by which to clean the file

                    list of dicts specifying each rule to be applied 
                    str specifying location of the rules file

url : str, url of the pipeline api, defaults to

                    https://pipeline-api.dativa.com/clean

status_url : url, the url to query for to check status of the api call, defaults to
```
                    https://pipeline-api.dativa.com/status/{0}
```
source_delimiter : str, the delimiter of the source file, defaults to ","
destination_delimiter : str, the delimiter of the destination file, defaults to ","
source_encoding : str, the encoding of the source file, defaults to "utf-8"
destination_encoding : str, the encoding of the destination file, defaults to "utf-8"

Example code

from dativa.tools.aws import PipelineClient

obj = PipelineClient(api_key=api_key,
                     rules=rules,
                     source_s3_url="https://s3-us-west-2.amazonaws.com/{0}/source_key".format(bucket),
                     destination_s3_url="https://s3-us-west-2.amazonaws.com/{0}/dest_key".format(bucket),
                     url="https://pipeline-api.dativa.com/clean",
                     status_url="https://pipeline-api.dativa.com/status/{0}",
                     )
obj.run_job()

To run tests

The API and AWS credentials must be present as environment variables for the testing to succeed

export DATIVA_PIPELINE_API_KEY=API_KEY_HERE
export AWS_ACCESS_KEY_ID=AWS_CREDENTIALS_HERE
export AWS_SECRET_ACCESS_KEY=AWS_CREDENTIALS_HERE

Legacy classes

The modules in the dativatools namespace are legacy only and will be deprecated in future.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.6
Topic
- Software Development :: Libraries

Release history Release notifications | RSS feed

3.3.832

May 1, 2020

3.3.813

Apr 15, 2020

3.2.796

Apr 7, 2020

3.2.789

Apr 2, 2020

3.2.782

Mar 29, 2020

3.2.779

Mar 29, 2020

3.2.758

Mar 19, 2020

3.2.752

Mar 16, 2020

3.2.749

Mar 14, 2020

3.2.727

Feb 26, 2020

3.1.693

Jan 30, 2020

3.1.683

Jan 24, 2020

3.1.664

Jan 15, 2020

3.1.621

Dec 9, 2019

3.1.602

Nov 27, 2019

3.1.585

Nov 20, 2019

3.0.607

Nov 27, 2019

3.0.578

Nov 18, 2019

3.0.569

Nov 15, 2019

3.0.538

Oct 17, 2019

3.0.513

Sep 24, 2019

3.0.508

Sep 23, 2019

3.0.506

Sep 23, 2019

3.0.477

Sep 5, 2019

3.0.435

Aug 14, 2019

3.0.417

Aug 1, 2019

3.0.410

Jul 31, 2019

3.0.409

Jul 30, 2019

3.0.407

Jul 29, 2019

3.0.404

Jul 28, 2019

3.0.381

Jul 17, 2019

3.0.380

Jul 15, 2019

3.0.378

Jul 14, 2019

3.0.371

Jul 14, 2019

3.0.368

Jul 11, 2019

3.0.367

Jul 9, 2019

3.0.366

Jul 6, 2019

3.0.365

Jul 5, 2019

3.0.364

Jul 2, 2019

3.0.361

Jul 2, 2019

3.0.355

Jun 28, 2019

3.0.354

Jun 27, 2019

3.0.352

Jun 26, 2019

3.0.351

Jun 14, 2019

3.0.350

Jun 14, 2019

2.12.345

Jun 9, 2019

2.12.342

Jun 9, 2019

2.12.341

Jun 6, 2019

2.12.339

Jun 4, 2019

2.12.335

May 24, 2019

2.12.326

May 23, 2019

2.12.317

May 17, 2019

2.12.283

May 3, 2019

2.12.274

Apr 29, 2019

2.12.272

Apr 4, 2019

2.12.271

Mar 21, 2019

2.12.269

Mar 19, 2019

2.12.268

Mar 11, 2019

2.12.266

Feb 26, 2019

2.12.263

Feb 25, 2019

2.12.258

Feb 7, 2019

2.12.255

Jan 30, 2019

2.12.245

Jan 7, 2019

2.12.244

Jan 5, 2019

2.12.240

Jan 5, 2019

2.12.229

Dec 14, 2018

2.12.227

Dec 13, 2018

2.12.221

Dec 4, 2018

2.12.215

Nov 28, 2018

2.12.214

Nov 28, 2018

2.12.191

Nov 20, 2018

2.12.190

Nov 19, 2018

2.12.189

Nov 19, 2018

2.11.188

Nov 16, 2018

2.11.186

Nov 7, 2018

2.11.181

Nov 5, 2018

2.11.180

Nov 5, 2018

2.11.176

Nov 5, 2018

2.11.172

Oct 30, 2018

This version

2.11.165

Oct 23, 2018

2.11.161

Oct 18, 2018

2.11.159

Oct 17, 2018

2.10.12

Sep 17, 2018

2.10.11

Sep 12, 2018

2.10.5

Sep 6, 2018

2.10.3

Aug 27, 2018

2.10.2

Aug 17, 2018

2.10.1

Aug 15, 2018

2.9.19

Aug 13, 2018

2.9.18

Aug 10, 2018

2.9.16

Aug 7, 2018

2.9.15

Aug 7, 2018

2.9.14

Jul 18, 2018

2.9.13

Jul 13, 2018

2.9.12

Jul 1, 2018

2.9.11

Jun 30, 2018

2.9.10

Jun 29, 2018

2.9.9

Jun 20, 2018

2.9.8

Jun 19, 2018

2.9.7

Jun 13, 2018

2.9.6

Jun 11, 2018

2.9.5

Jun 11, 2018

2.9.4

Jun 7, 2018

2.9.3

Jun 5, 2018

2.9.2

Jun 5, 2018

2.9.1

Jun 4, 2018

2.9

May 17, 2018

2.8.4444

May 17, 2018

2.8.5

May 17, 2018

2.8.3

May 14, 2018

2.8.2

May 14, 2018

2.8.1

May 14, 2018

2.8

May 12, 2018

2.7.1

May 12, 2018

2.7

May 12, 2018

2.6.13

Jan 19, 2018

2.6.12

Jan 3, 2018

2.6.11

Dec 15, 2017

2.6.10

Dec 8, 2017

2.6.9

Dec 7, 2017

2.6.8

Dec 7, 2017

2.6.7

Nov 22, 2017

2.6.6

Nov 9, 2017

2.6.5

Nov 8, 2017

2.6.4

Nov 8, 2017

2.6.3

Oct 27, 2017

2.6.2

Oct 16, 2017

2.6.1

Oct 9, 2017

2.6

Sep 25, 2017

2.5

Sep 22, 2017

2.4

Sep 22, 2017

2.2

Sep 15, 2017

2.1

Sep 13, 2017

2.0

Sep 12, 2017

1.9

Sep 8, 2017

1.8

Sep 7, 2017

1.7

Sep 6, 2017

1.6

Sep 6, 2017

1.5

Sep 1, 2017

1.4

Sep 1, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dativatools-2.11.165.tar.gz (59.2 kB view details)

Uploaded Oct 23, 2018 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dativatools-2.11.165-py2.py3-none-any.whl (66.5 kB view details)

Uploaded Oct 23, 2018 Python 2Python 3

File details

Details for the file dativatools-2.11.165.tar.gz.

File metadata

Download URL: dativatools-2.11.165.tar.gz
Upload date: Oct 23, 2018
Size: 59.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.3

File hashes

Hashes for dativatools-2.11.165.tar.gz
Algorithm	Hash digest
SHA256	`dc50e7af63bd9bc830a5dd3a6534e5d16844a4974f568bdf5cc6cbdf62c1c25e`
MD5	`bc89104b1818f2eadca00ab7f4620aa6`
BLAKE2b-256	`2260d7f3445333c44d0117ea4c09fd6b5d7ef18a387f3adb65e2cd4341a8c831`

See more details on using hashes here.

File details

Details for the file dativatools-2.11.165-py2.py3-none-any.whl.

File metadata

Download URL: dativatools-2.11.165-py2.py3-none-any.whl
Upload date: Oct 23, 2018
Size: 66.5 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.3

File hashes

Hashes for dativatools-2.11.165-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`e2ac7712e75b725c3d3d4e55dca8b1706b8e3fd0a8f6ad22832ea0006b1bf63a`
MD5	`8d9636e11bf0d7ed74b44d81f9e9ce4c`
BLAKE2b-256	`13e014f19fabfa10602bb518accc8dae8a4bdec470cad424d282489d3670844e`

See more details on using hashes here.

dativatools 2.11.165

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Dativa Tools

Installation

Description

dativa.tools.aws.AthenaClient

Creating tables

Running queries

Fetch results of query

Running queries with the output in Parquet and create an Athena table

dativa.tools.aws.S3Client

Batch deleting of files on S3

Copy files from folder in local filesystem to s3 bucket

dativa.tools.SQLClient

execute_query

execute_query_to_df

def execute_query_to_csv

Example code

dativa.tools.log_to_stdout

dativa.tools.pandas.CSVHandler

CSVHandler

load_df

save_df

df_to_string

Example code

Support functions for Pandas

dativa.tools.pandas.ParquetHandler

dativa.tools.aws import S3Csv2Parquet

S3Csv2Parquet

convert

dativa.tools.aws.PipelineClient

Example code

To run tests

Legacy classes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes