No project description provided
Project description
Python Library, which facilitates the processing of JSON files stored in Google Cloud Storage, transforming and loading them into Google BigQuery. This README includes an overview, installation instructions, dependencies, example usage, and additional details to help users get started.
Features:
- Batch process JSON files from GCS.
- Optionally add record entry timestamps and original file names to the dataset.
- Move processed files to a new folder within the same GCS bucket.
- Load transformed data into Google BigQuery in manageable chunks.
Installation
Install the package via pip:
pip install JSON_file_streaming_GCS_BigQuery
Dependencies
- google-cloud-storage: To interact with Google Cloud Storage.
- google-cloud-bigquery: For operations related to Google BigQuery.
- pandas: For data manipulation and transformation.
- json: To parse JSON files.
- os: For operating system dependent functionality.
Ensure these dependencies are installed using:
pip install google-cloud-storage google-cloud-bigquery pandas
Usage
Example: Processing JSON Files from GCS and Loading into BigQuery
from your_library import process_json_file_streaming
process_json_file_streaming(
dataset_id='your_dataset_id',
table_name='your_table_name',
project_id='your_project_id',
bucket_name='your_bucket_name',
source_folder_name='source_folder',
destination_folder_name='destination_folder',
chunk_size=10000,
add_record_entry_time=True,
add_file_name=True
)
Parameters:
- dataset_id (str): The BigQuery dataset ID.
- table_name (str): The BigQuery table name where data will be loaded.
- project_id (str): The Google Cloud project ID.
- bucket_name (str): The GCS bucket containing the source JSON files.
- source_folder_name (str): Folder in GCS bucket where source JSON files are stored.
- destination_folder_name (str): Folder to which processed JSON files are moved.
- chunk_size (int, optional): Number of records per batch to be loaded into BigQuery.
- add_record_entry_time (bool, optional): If True, adds a timestamp column to the dataset.
- add_file_name (bool, optional): If True, adds the original file name as a column in the dataset.
Configuration Ensure you have configured credentials for Google Cloud:
For interacting with Google Cloud services, ensure your environment is set up with the appropriate credentials (using Google Cloud SDK or setting the GOOGLE_APPLICATION_CREDENTIALS environment variable to your service account key file).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file JSON_file_streaming_GCS_BigQuery-0.0.tar.gz
.
File metadata
- Download URL: JSON_file_streaming_GCS_BigQuery-0.0.tar.gz
- Upload date:
- Size: 3.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b43132a9acadf9142cacb03691012905333df676ea492b951a893a28c9844891 |
|
MD5 | 7aea888ed07cf26199bb7de18ade5542 |
|
BLAKE2b-256 | 157c8dde20ff946a28b18131af9348c9bd079d90f13b7527e392cf5e6918d822 |
File details
Details for the file JSON_file_streaming_GCS_BigQuery-0.0-py3-none-any.whl
.
File metadata
- Download URL: JSON_file_streaming_GCS_BigQuery-0.0-py3-none-any.whl
- Upload date:
- Size: 4.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40a8f4751b115c82727bfc38c3a9d19a7998782b16189a0c9a34336709e9a99c |
|
MD5 | a36192f907f6706b58fd1054998663ec |
|
BLAKE2b-256 | 3cfe4c0f238d8fcc9177ccd50304b6abaf9aa104271b7dbe52ae60d334483593 |