Common Python tools and utilities for data engineering.

These details have not been verified by PyPI

Project description

hip-data-tools

Common Python tools and utilities for data engineering, ETL, Exploration, etc. The package is uploaded to PyPi for easy drop and use in various environmnets, such as (but not limited to):

Running production workloads
ML Training in Jupyter like notebooks
Local machine for dev and exploration

Installation

Install from PyPi repo:

pip3 install hip-data-tools

Install from source

pip3 install .

MacOS Dependencies

brew install libev
brew install librdkafka

Connect to aws

You will need to instantiate an AWS Connection:

from hip_data_tools.aws.common import AwsConnectionManager, AwsConnectionSettings, AwsSecretsManager

# to connect using an aws cli profile
conn = AwsConnectionManager(AwsConnectionSettings(region="ap-southeast-2", secrets_manager=None, profile="default"))

# OR if you want to connect using the standard aws environment variables
conn = AwsConnectionManager(settings=AwsConnectionSettings(region="ap-southeast-2", secrets_manager=AwsSecretsManager(), profile=None))

# OR if you want custom set of env vars to connect
conn = AwsConnectionManager(
    settings=AwsConnectionSettings(
        region="ap-southeast-2",
        secrets_manager=AwsSecretsManager(
            access_key_id_var="SOME_CUSTOM_AWS_ACCESS_KEY_ID",
            secret_access_key_var="SOME_CUSTOM_AWS_SECRET_ACCESS_KEY",
            use_session_token=True,
            aws_session_token_var="SOME_CUSTOM_AWS_SESSION_TOKEN"
            ),
        profile=None,
        )
    )

Using this connection to object you can use the aws utilities, for example aws Athena:

from hip_data_tools.aws.athena import AthenaUtil

au = AthenaUtil(database="default", conn=conn, output_bucket="example", output_key="tmp/scratch/")
result = au.run_query("SELECT * FROM temp limit 10", return_result=True)
print(result)

Connect to Cassandra

from cassandra.policies import DCAwareRoundRobinPolicy
from cassandra.cqlengine import columns
from cassandra.cqlengine.management import sync_table
from cassandra.cqlengine.models import Model
from cassandra import ConsistencyLevel

load_balancing_policy = DCAwareRoundRobinPolicy(local_dc='AWS_VPC_AP_SOUTHEAST_2')

conn = CassandraConnectionManager(
   settings = CassandraConnectionSettings(
       cluster_ips=["1.1.1.1", "2.2.2.2"],
       port=9042,
       load_balancing_policy=load_balancing_policy,
   ),
   consistency_level=ConsistencyLevel.LOCAL_QUORUM
)

conn = CassandraConnectionManager(
   CassandraConnectionSettings(
       cluster_ips=["1.1.1.1", "2.2.2.2"],
       port=9042,
       load_balancing_policy=load_balancing_policy,
       secrets_manager=CassandraSecretsManager(
       username_var="MY_CUSTOM_USERNAME_ENV_VAR"),
   ),
   consistency_level=ConsistencyLevel.LOCAL_ONE
)

# For running Cassandra model operations
conn.setup_connection("dev_space")
class ExampleModel(Model):
   example_type    = columns.Integer(primary_key=True)
   created_at      = columns.DateTime()
   description     = columns.Text(required=False)
sync_table(ExampleModel)

Connect to Google Sheets

How to connect

You need to go to Google developer console and get credentials. Then the Google sheet need to be shared with client email. GoogleApiConnectionSettings need to be provided with the Google API credentials key json. Then you can access the Google sheet by using the workbook_url and the sheet name.

How to instantiate Sheet Util

You can instantiate Sheet Util by providing GoogleSheetConnectionManager, workbook_url and the sheet name.

sheet_util = SheetUtil(
    conn_manager=GoogleSheetConnectionManager(self.settings.source_connection_settings),
    workbook_url='https://docs.google.com/spreadsheets/d/cKyrzCBLfsQM/edit?usp=sharing',
    sheet='Sheet1')

How to read a dataframe using SheetUtil

You can get the data in the Google sheet as a Pandas DataFrame using the SheetUtil. We have defined a template for the Google sheet to use with this utility.

alt text

You need to provide the "field_names_row_number" and "field_types_row_number" to call "get_dataframe()" method in SheetUtil.

sheet_data = sheet_util.get_data_frame(
                field_names_row_number=8,
                field_types_row_number=7,
                row_range="12:20",
                data_start_row_number=9)

You can use load_sheet_to_athena() function to load Google sheet data into an Athena table.

GoogleSheetToAthena(GoogleSheetsToAthenaSettings(
        source_workbook_url='https://docs.google.com/spreadsheets/d/cKyrzCBLfsQM/edit?usp=sharing',
        source_sheet='spec_example',
        source_row_range=None,
        source_fields=None,
        source_field_names_row_number=5,
        source_field_types_row_number=4,
        source_data_start_row_number=6,
        source_connection_settings=get_google_connection_settings(gcp_conn_id=GCP_CONN_ID),
        manual_partition_key_value={"column": "start_date", "value": START_DATE},
        target_database=athena_util.database,
        target_table_name=TABLE_NAME,
        target_s3_bucket=s3_util.bucket,
        target_s3_dir=s3_dir,
        target_connection_settings=get_aws_connection_settings(aws_conn_id=AWS_CONN_ID),
        target_table_ddl_progress=False
    )).load_sheet_to_athena()

There is an integration test called "integration_test_should__load_sheet_to_athena__when_using_sheetUtil" to test this functionality. You can simply run it by removing the "integration_" prefix.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.68.1

Sep 21, 2023

1.68.0

Sep 20, 2023

1.67.2

Mar 22, 2023

1.67.1

Mar 20, 2023

1.67.0

Mar 20, 2023

1.66.1

Jan 5, 2023

1.66.0

Jun 15, 2022

1.65.1

Jun 13, 2022

1.65.0

Jun 9, 2022

1.64.0

Jun 7, 2022

1.63.0

May 18, 2022

1.62.9

May 11, 2022

1.62.8

May 8, 2022

1.62.7

May 6, 2022

1.62.6

May 6, 2022

1.62.5

May 5, 2022

1.62.4

May 5, 2022

1.62.3

May 5, 2022

1.62.2

May 5, 2022

1.62.1

May 5, 2022

1.62.0

May 4, 2022

1.61.1

May 3, 2022

1.61.0

May 2, 2022

1.60.4

Apr 27, 2022

1.60.3

Apr 27, 2022

1.60.2

Apr 27, 2022

1.60.1

Apr 27, 2022

1.60.0

Apr 22, 2022

1.59.0

Apr 22, 2022

1.58.0

Aug 19, 2021

1.57.1

Aug 3, 2021

1.57.0

Jul 12, 2021

1.56.3

Mar 30, 2021

1.56.2

Mar 27, 2021

1.56.1

Mar 26, 2021

1.56.0

Mar 22, 2021

1.55.1

Mar 3, 2021

1.55.0

Mar 2, 2021

1.54.0

Jan 8, 2021

1.53.0

Jan 7, 2021

1.52.4

Nov 29, 2020

1.52.3

Nov 18, 2020

1.52.2

Nov 6, 2020

1.52.1

Sep 3, 2020

1.52.0

Aug 28, 2020

1.51.1

Aug 26, 2020

1.51.0

Aug 25, 2020

1.50.0

Aug 11, 2020

1.49.0

Aug 6, 2020

1.48.0

Aug 6, 2020

1.47.6

Jul 30, 2020

1.47.5

Jul 27, 2020

1.47.4

Jul 3, 2020

1.47.3

Jun 30, 2020

1.47.2

Jun 25, 2020

1.47.1

Jun 24, 2020

1.47.0

Jun 19, 2020

1.46.0

Jun 18, 2020

1.45.0

Jun 1, 2020

1.44.1

May 29, 2020

1.44.0

May 28, 2020

1.43.1

May 21, 2020

1.43.0

May 21, 2020

1.42.2

May 14, 2020

1.42.1

May 12, 2020

1.42.0

May 8, 2020

1.41.0

May 8, 2020

1.40.3

May 8, 2020

1.40.2

May 7, 2020

1.40.1

May 7, 2020

1.40.0

May 5, 2020

1.39.0

May 4, 2020

1.38.2

Apr 29, 2020

1.38.1

Apr 29, 2020

1.38.0

Apr 27, 2020

1.37.0

Apr 22, 2020

1.36.0

Apr 21, 2020

1.35.0

Apr 20, 2020

1.34.0

Apr 16, 2020

1.33.0

Apr 14, 2020

1.32.0

Mar 31, 2020

1.31.0

Mar 23, 2020

1.30.0

Mar 22, 2020

1.29.0

Mar 18, 2020

1.28.0

Mar 17, 2020

1.27.0

Mar 17, 2020

1.26.0

Mar 16, 2020

1.25.0

Mar 16, 2020

1.24.0

Mar 13, 2020

1.23.0

Mar 11, 2020

1.22.0

Mar 2, 2020

1.21.0

Feb 28, 2020

1.20.0

Feb 24, 2020

1.19.0

Feb 21, 2020

1.18.0

Feb 19, 2020

1.17.0

Feb 19, 2020

1.16.0

Feb 18, 2020

1.15.0

Feb 18, 2020

1.14.1

Feb 12, 2020

1.14.0

Feb 11, 2020

1.13.0

Feb 10, 2020

1.12.0

Feb 2, 2020

1.11.0

Jan 29, 2020

1.10.0

Jan 28, 2020

1.9.0

Jan 22, 2020

1.8.0

Jan 20, 2020

1.7.3

Dec 17, 2019

1.7.2

Dec 17, 2019

1.7.1

Oct 16, 2019

1.7.0

Oct 14, 2019

1.6.0

Oct 11, 2019

1.5.0

Sep 26, 2019

1.4.0

Sep 25, 2019

1.3.1

Aug 30, 2019

1.3.0

Aug 27, 2019

1.2.1

Aug 21, 2019

1.2.1.dev1 pre-release

Aug 21, 2019

1.2.0

Aug 21, 2019

1.1.0

Aug 19, 2019

0.0.0

Aug 21, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hip_data_tools-1.68.1.tar.gz (51.6 kB view hashes)

Uploaded Sep 21, 2023 Source

Built Distribution

hip_data_tools-1.68.1-py3-none-any.whl (61.6 kB view hashes)

Uploaded Sep 21, 2023 Python 3

Hashes for hip_data_tools-1.68.1.tar.gz

Hashes for hip_data_tools-1.68.1.tar.gz
Algorithm	Hash digest
SHA256	`8fe07774acf6c1d2621c8e43f2bba11ab3436988497c9b583fa677c9aada0246`
MD5	`e5ef818dea3ac15137add6197c8c67ef`
BLAKE2b-256	`32e742fdfc7768782b38b72b92d2189414dfc5a7f290374d266965c1e4fbad3b`

Hashes for hip_data_tools-1.68.1-py3-none-any.whl

Hashes for hip_data_tools-1.68.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5f8ea7af17aab44b1160c2c9edcac78e67a3f115952f3f2316947ccbd53460fc`
MD5	`0133a9b2f99e30a0614e851665adf03e`
BLAKE2b-256	`067095edc99a91e542926037340252a36bb78ab1c8fb5170200f91c5fea85b01`