Skip to main content

No project description provided

Project description

Python Library, which facilitates the processing of JSON files stored in Google Cloud Storage, transforming and loading them into Google BigQuery. This README includes an overview, installation instructions, dependencies, example usage, and additional details to help users get started.

Features:

  1. Batch process JSON files from GCS.
  2. Optionally add record entry timestamps and original file names to the dataset.
  3. Move processed files to a new folder within the same GCS bucket.
  4. Load transformed data into Google BigQuery in manageable chunks.

Installation

Install the package via pip:

pip install JSON_file_streaming_GCS_BigQuery

Dependencies

  1. google-cloud-storage: To interact with Google Cloud Storage.
  2. google-cloud-bigquery: For operations related to Google BigQuery.
  3. pandas: For data manipulation and transformation.
  4. json: To parse JSON files.
  5. os: For operating system dependent functionality.

Ensure these dependencies are installed using:

pip install google-cloud-storage google-cloud-bigquery pandas

Usage

Example: Processing JSON Files from GCS and Loading into BigQuery

from your_library import process_json_file_streaming

process_json_file_streaming(
    dataset_id='your_dataset_id',
    table_name='your_table_name',
    project_id='your_project_id',
    bucket_name='your_bucket_name',
    source_folder_name='source_folder',
    destination_folder_name='destination_folder',
    chunk_size=10000,
    add_record_entry_time=True,
    add_file_name=True
)

Parameters:

  1. dataset_id (str): The BigQuery dataset ID.
  2. table_name (str): The BigQuery table name where data will be loaded.
  3. project_id (str): The Google Cloud project ID.
  4. bucket_name (str): The GCS bucket containing the source JSON files.
  5. source_folder_name (str): Folder in GCS bucket where source JSON files are stored.
  6. destination_folder_name (str): Folder to which processed JSON files are moved.
  7. chunk_size (int, optional): Number of records per batch to be loaded into BigQuery.
  8. add_record_entry_time (bool, optional): If True, adds a timestamp column to the dataset.
  9. add_file_name (bool, optional): If True, adds the original file name as a column in the dataset.

Configuration Ensure you have configured credentials for Google Cloud:

For interacting with Google Cloud services, ensure your environment is set up with the appropriate credentials (using Google Cloud SDK or setting the GOOGLE_APPLICATION_CREDENTIALS environment variable to your service account key file).

Project details


Release history Release notifications | RSS feed

This version

0.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

JSON_file_streaming_GCS_BigQuery-0.0.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file JSON_file_streaming_GCS_BigQuery-0.0.tar.gz.

File metadata

File hashes

Hashes for JSON_file_streaming_GCS_BigQuery-0.0.tar.gz
Algorithm Hash digest
SHA256 b43132a9acadf9142cacb03691012905333df676ea492b951a893a28c9844891
MD5 7aea888ed07cf26199bb7de18ade5542
BLAKE2b-256 157c8dde20ff946a28b18131af9348c9bd079d90f13b7527e392cf5e6918d822

See more details on using hashes here.

File details

Details for the file JSON_file_streaming_GCS_BigQuery-0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for JSON_file_streaming_GCS_BigQuery-0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 40a8f4751b115c82727bfc38c3a9d19a7998782b16189a0c9a34336709e9a99c
MD5 a36192f907f6706b58fd1054998663ec
BLAKE2b-256 3cfe4c0f238d8fcc9177ccd50304b6abaf9aa104271b7dbe52ae60d334483593

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page