Functionalities to interact with Google and Azure, and clean data
Project description
do-data-utils
This package provides you the functionalities to connect to different cloud sources and data cleaning functions. Package repo on PyPI: do-data-utils - PyPI
For a full list of functions, see the overview documentation.
Installation
Commands
To install the latest version from main branch, use the following command:
pip install do-data-utils
You can install a specific version, for example,
pip install do-data-utils==3.2.0
Available Subpackages
google– Utilities for Google Cloud Platform.azure– Utilities for Azure services.pathutils– Utilities related to paths.preprocessing– Utilities for data preprocessing.sharepoint- Utilities for interacting with Microsoft Sharepoint.
For a full list of functions, see the overview documentation.
Example Usage
The concept of using this revolves around the idea that:
- You keep service account JSON secrets (for cloud services) in GCP secret manager
- You have local JSON secret file for accessing the GCP secret manager
- Retrive the secret you want to interact with cloud platform from GCP secret manager
- Do your stuff...
GCS
Download
from do_data_utils.google import get_secret, gcs_to_df
# Load secret key and get the secret to access GCS
secret_path = 'secrets/secret-manager-key.json'
secret = get_secret(secret_id='gcs-secret-id-dev', secret=secret_path, as_json=True)
# Download a csv file to DataFrame
gcspath = 'gs://my-ai-bucket/my-path-to-csv.csv'
df = gcs_to_df(gcspath, secret, polars=False)
from do_data_utils.google import get_secret, gcs_to_dict
# Load secret key and get the secret to access GCS
secret_path = 'secrets/secret-manager-key.json'
secret = get_secret(secret_id='gcs-secret-id-dev', secret=secret_path, as_json=True)
# Download the content from GCS
gcspath = 'gs://my-ai-bucket/my-path-to-json.json'
my_dict = gcs_to_dict(gcspath, secret=secret)
Upload
from do_data_utils.google import get_secret, dict_to_json_gcs
# Load secret key and get the secret to access GCS
secret_path = 'secrets/secret-manager-key.json'
# No need to read in the secret info from version 2.3.0
with open('secrets/secret-manager-key.json', 'r') as f:
secret_info = json.load(f)
# you can pass in either dict or path to JSON in `secret` argument
secret = get_secret(secret_id='gcs-secret-id-dev', secret=secret_info, as_json=True)
my_setting_dict = {
'param1': 'abc',
'param2': 'xyz',
}
gcspath = 'gs://my-bucket/my-path-to-json.json'
dict_to_json_gcs(dict_data=my_setting_dict, gcspath=gcspath, secret=secret)
GBQ
from do_data_utils.google import get_secret, gbq_to_df
# Load secret key and get the secret to access GCS
with open('secrets/secret-manager-key.json', 'r') as f:
secret_info = json.load(f)
# you can pass in either dict or path to JSON in `secret` argument
secret = get_secret(secret_id='gbq-secret-id-dev', secret=secret_info, as_json=True)
# Query
query = 'select * from my-project.my-dataset.my-table'
df = gbq_to_df(query, secret, polars=False)
Azure/Databricks
from do_data_utils.azure import databricks_to_df
# Load secret key and get the secret to access GCS
with open('secrets/secret-manager-key.json', 'r') as f:
secret_info = json.load(f)
secret = get_secret(secret_id='databricks-secret-id-dev', secret=secret_info, as_json=True)
# Download from Databricks sql
query = 'select * from datadev.dsplayground.my_table'
df = databricks_to_df(query, secret, polars=False)
For more functions, see the overview documentation.
Path utils
from do_data_utils.pathutils import add_project_root
# Adds your root folder to sys.path,
# so you can do imports from the root directory
add_project_root(levels_up=1)
Preprocessing
from do_data_utils.preprocessing import clean_phone, clean_citizenid
phone_numbers = '090-123-4567|0912345678|0901234567-9'
phones_valid = clean_phone(phone_numbers) # Gets the valid phone numbers
citizenid = '0123456789012'
citizenid_cleaned = clean_citizenid(citizenid)
Sharepoint
import pandas as pd
from do_data_utils.google import get_secret
from do_data_utils.sharepoint import df_to_sharepoint
# Load secret key and get the secret to access GCS
secret_path = "secrets/secret-manager-key.json"
ms_secret = get_secret(secret_id="sharepoint-secret", secret=secret_path, as_json=True)
refresh_token = get_secret(
secret_id="sharepoint-refresh-token", secret=secret_path, as_json=False
)
# Example DataFrame
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
site = "your-site"
sharepoint_dir = "Shared Documents/some/path"
file_name = "output.xlsx" # or .csv if you wish
df_to_sharepoint(
df,
site=site,
sharepoint_dir=sharepoint_dir,
file_name=file_name,
secret=ms_secret,
refresh_token=refresh_token,
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file do_data_utils-3.2.0.tar.gz.
File metadata
- Download URL: do_data_utils-3.2.0.tar.gz
- Upload date:
- Size: 81.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41d7f61815417f894fc0731ebbac52312bf8b9c3c061dc594f53bc5d5507b47a
|
|
| MD5 |
471690b27934f80c1322d5d660e31781
|
|
| BLAKE2b-256 |
5e72b664cff2e26f33a88004164b80b2d692570f08ff3ddf7402810f25e3fd6e
|
File details
Details for the file do_data_utils-3.2.0-py3-none-any.whl.
File metadata
- Download URL: do_data_utils-3.2.0-py3-none-any.whl
- Upload date:
- Size: 23.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b617729e3a2082b5ce236e7803967296cfbc70731f832f3a84b7c40a2dc0097d
|
|
| MD5 |
f2d06a62fcce5ae253eae7c1534d621f
|
|
| BLAKE2b-256 |
7c975dd9bd9a299ef90edb8e7a26f89190fc22d10e81ba9ad308e8a322ada4fe
|