Functionalities to interact with Google and Azure, and clean data

These details have not been verified by PyPI

Project links

Homepage

Project description

do-data-utils

Static Typed Checks Continuous Testing Publish Tag to PyPI

This package provides you the functionalities to connect to different cloud sources and data cleaning functions. Package repo on PyPI: do-data-utils - PyPI

For a full list of functions, see the overview documentation.

Installation

Commands

To install the latest version from main branch, use the following command:

pip install do-data-utils

You can install a specific version, for example,

pip install do-data-utils==2.7.0

Available Subpackages

google – Utilities for Google Cloud Platform.
azure – Utilities for Azure services.
pathutils – Utilities related to paths.
preprocessing – Utilities for data preprocessing.
sharepoint - Utilities for interacting with Microsoft Sharepoint.

For a full list of functions, see the overview documentation.

Example Usage

The concept of using this revolves around the idea that:

You keep service account JSON secrets (for cloud services) in GCP secret manager
You have local JSON secret file for accessing the GCP secret manager
Retrive the secret you want to interact with cloud platform from GCP secret manager
Do your stuff...

Google

GCS

Download

from do_data_utils.google import get_secret, gcs_to_df


# Load secret key and get the secret to access GCS
secret_path = 'secrets/secret-manager-key.json'
secret = get_secret(secret_id='gcs-secret-id-dev', secret=secret_path, as_json=True)

# Download a csv file to DataFrame
gcspath = 'gs://my-ai-bucket/my-path-to-csv.csv'
df = gcs_to_df(gcspath, secret, polars=False)

from do_data_utils.google import get_secret, gcs_to_dict


# Load secret key and get the secret to access GCS
secret_path = 'secrets/secret-manager-key.json'
secret = get_secret(secret_id='gcs-secret-id-dev', secret=secret_path, as_json=True)

# Download the content from GCS
gcspath = 'gs://my-ai-bucket/my-path-to-json.json'
my_dict = gcs_to_dict(gcspath, secret=secret)

Upload

from do_data_utils.google import get_secret, dict_to_json_gcs


# Load secret key and get the secret to access GCS
secret_path = 'secrets/secret-manager-key.json'

# No need to read in the secret info from version 2.3.0
with open('secrets/secret-manager-key.json', 'r') as f:
    secret_info = json.load(f)

# you can pass in either dict or path to JSON in `secret` argument
secret = get_secret(secret_id='gcs-secret-id-dev', secret=secret_info, as_json=True) 

my_setting_dict = {
    'param1': 'abc',
    'param2': 'xyz',
}

gcspath = 'gs://my-bucket/my-path-to-json.json'
dict_to_json_gcs(dict_data=my_setting_dict, gcspath=gcspath, secret=secret)

GBQ

from do_data_utils.google import get_secret, gbq_to_df


# Load secret key and get the secret to access GCS
with open('secrets/secret-manager-key.json', 'r') as f:
    secret_info = json.load(f)

# you can pass in either dict or path to JSON in `secret` argument
secret = get_secret(secret_id='gbq-secret-id-dev', secret=secret_info, as_json=True)

# Query
query = 'select * from my-project.my-dataset.my-table'
df = gbq_to_df(query, secret, polars=False)

Azure/Databricks

from do_data_utils.azure import databricks_to_df


# Load secret key and get the secret to access GCS
with open('secrets/secret-manager-key.json', 'r') as f:
    secret_info = json.load(f)

secret = get_secret(secret_id='databricks-secret-id-dev', secret=secret_info, as_json=True)

# Download from Databricks sql
query = 'select * from datadev.dsplayground.my_table'
df = databricks_to_df(query, secret, polars=False)

For more functions, see the overview documentation.

Path utils

from do_data_utils.pathutils import add_project_root

# Adds your root folder to sys.path,
# so you can do imports from the root directory
add_project_root(levels_up=1)

Preprocessing

from do_data_utils.preprocessing import clean_phone, clean_citizenid

phone_numbers = '090-123-4567|0912345678|0901234567-9'
phones_valid = clean_phone(phone_numbers) # Gets the valid phone numbers

citizenid = '0123456789012'
citizenid_cleaned = clean_citizenid(citizenid)

Sharepoint

import pandas as pd
from do_data_utils.google import get_secret
from do_data_utils.sharepoint import df_to_sharepoint

# Load secret key and get the secret to access GCS
secret_path = "secrets/secret-manager-key.json"

ms_secret = get_secret(secret_id="sharepoint-secret", secret=secret_path, as_json=True)
refresh_token = get_secret(
    secret_id="sharepoint-refresh-token", secret=secret_path, as_json=False
)

# Example DataFrame
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

site = "your-site"
sharepoint_dir = "Shared Documents/some/path"
file_name = "output.xlsx"  # or .csv if you wish

df_to_sharepoint(
    df,
    site=site,
    sharepoint_dir=sharepoint_dir,
    file_name=file_name,
    secret=ms_secret,
    refresh_token=refresh_token,
)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

4.2.1

Jun 17, 2025

4.2.0

Jun 12, 2025

4.1.0

Jun 6, 2025

4.0.0

Apr 18, 2025

3.2.4

Feb 25, 2025

3.2.3

Feb 21, 2025

3.2.2

Feb 20, 2025

3.2.1

Feb 18, 2025

3.2.0

Feb 6, 2025

3.1.0

Jan 18, 2025

3.0.0

Dec 25, 2024

3.0.0b1 pre-release

Dec 26, 2024

This version

2.7.1

Dec 25, 2024

2.7.0

Dec 25, 2024

2.6.0

Dec 24, 2024

2.5.0

Dec 11, 2024

2.4.0

Dec 11, 2024

2.3.2

Dec 9, 2024

2.3.1

Dec 8, 2024

2.3.0

Dec 7, 2024

2.2.0

Dec 6, 2024

2.1.0

Dec 6, 2024

2.0.0

Dec 6, 2024

1.2.2

Dec 6, 2024

1.2.1

Dec 6, 2024

1.2.0

Dec 6, 2024

1.1.4

Dec 6, 2024

1.1.3

Dec 6, 2024

1.1.2

Dec 6, 2024

1.1.1

Dec 6, 2024

1.1.0

Dec 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

do_data_utils-2.7.1.tar.gz (28.9 kB view details)

Uploaded Dec 25, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

do_data_utils-2.7.1-py3-none-any.whl (39.3 kB view details)

Uploaded Dec 25, 2024 Python 3

File details

Details for the file do_data_utils-2.7.1.tar.gz.

File metadata

Download URL: do_data_utils-2.7.1.tar.gz
Upload date: Dec 25, 2024
Size: 28.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for do_data_utils-2.7.1.tar.gz
Algorithm	Hash digest
SHA256	`0e26b51a38d179615136af1f16b858947be45770778136ae0702a2ab93f213b0`
MD5	`f6716aba5bb3bda788a75620d435ed3f`
BLAKE2b-256	`258bf99ae421002f3c4b6c5c4a446110676b2ed033eb83b1b892115978a31690`

See more details on using hashes here.

File details

Details for the file do_data_utils-2.7.1-py3-none-any.whl.

File metadata

Download URL: do_data_utils-2.7.1-py3-none-any.whl
Upload date: Dec 25, 2024
Size: 39.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for do_data_utils-2.7.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b667bd9874b555626bd32c622f044b06dc39395c811a7dbfcffab3a9d95dbad3`
MD5	`3d4001e88a4b776c3d21f0ccd88a9abc`
BLAKE2b-256	`31f999271238477af9441b6319cd926e0b667415f413f91f9236670cc432a6af`

See more details on using hashes here.

do-data-utils 2.7.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

do-data-utils

Installation

Commands

Available Subpackages

Example Usage

Google

GCS

Download

Upload

GBQ

Azure/Databricks

Path utils

Preprocessing

Sharepoint

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes