Skip to main content

Labelbox Connector for BigQuery

Project description

Labelbox Connector for Google BigQuery

Access the Labelbox Connector for Google BigQuery to easily upload your CSV of text snippets to BigQuery, select columns, and add that dataset to Labelbox for annotation in our text tool. This is a very specific library for text use-cases, although it may be expanded to support other use-cases as needed in BigQuery.

The Demo code supplied in this Github is designed to run in a Google Co-Lab, but the code can be adapted to any notebook environment.

Labelbox is the enterprise-grade training data solution with fast AI enabled labeling tools, labeling automation, human workforce, data management, a powerful API for integration & SDK for extensibility. Visit Labelbox for more information.

This library is currently in beta. It may contain errors or inaccuracies and may not function as well as commercially released software. Please report any issues/bugs via Github Issues.

Table of Contents

Requirements

Installation

Install LabelBigQuery to your Python environment. The installation will also add the Labelbox SDK and BigQuery SDK.

pip install labelbigquery

Documentation

LabelBigQuery includes several methods to help facilitate your workflow between BigQuery and Labelbox.

  1. Add your CSV contents to BigQuery (only necessary if you don't have your data in BigQuery yet):
   #define headers and fields for BigQuery data load
    SELECTED_HEADERS = {
        'conversation_id',
        'normalized_query'
    }

    SCHEMA_FIELDS = [
        bigquery.SchemaField("conversation_id", "STRING"),
        bigquery.SchemaField("normalized_query", "STRING"),
    ]

    labelbigquery.load_data_to_big_query(bq_client, args.table_name, args.csv_file_name,
                                         SELECTED_HEADERS = SELECTED_HEADERS,SCHEMA_FIELDS = SCHEMA_FIELDS)

Where "SELECTED_HEADERS" and "SCHEMA_FIELDS" specifies the columns of your CSV that you want to send to BigQuery, along with the type definitions for proper storage in BigQuery.

Labelbigquery for text requires two columns of data; a unique identifier (becomes the "External ID" in our system), and a corresponding text string. Here is a chatbot example table:

conversation_id normalized_query
sample_1 Some text string here for labeling.
sample_2 Some text string here for labeling.
sample_3 Some text string here for labeling.
  1. Submit a query to BigQuery for your target columns. This will also write individual text files to a "data" folder. The file names are based off the unique identifier ("conversation id" in the above example).
    query = fr'SELECT conversation_id, STRING_AGG(normalized_query, "\n") FROM {args.table_name} GROUP BY conversation_id'
    file_names = labelbigquery.fetch_and_write_rows(bq_client, query=query)
  1. Submit your files to Labelbox for annotation in the text editor.
    lb_dataset = labelbigquery.make_dataset_and_data_rows(lb_client, file_names, args.dataset_name)
    print("Dataset unique identifier: " + lb_dataset.uid)

While using LabelBigQuery, you will likely also use the Labelbox SDK (e.g. for programmatic ontology creation). These resources will help familiarize you with the Labelbox Python SDK:

Authentication

Labelbox uses API keys to validate requests. You can create and manage API keys on Labelbox.

Contribution

Please consult CONTRIB.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

labelbigquery-0.1.0.tar.gz (7.9 kB view hashes)

Uploaded Source

Built Distribution

labelbigquery-0.1.0-py3-none-any.whl (9.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page