Labelbox Connector for Databricks

Project description

Labelbox Connector for Databricks

Access the Labelbox Connector for Databricks to connect an unstructured dataset to Labelbox, programmatically set up an ontology for labeling, and return the labeled dataset in a Spark DataFrame. This library was designed to run in a Databricks environment, although it will function in any Spark environment with some modification.

Labelbox is the enterprise-grade training data solution with fast AI enabled labeling tools, labeling automation, human workforce, data management, a powerful API for integration & SDK for extensibility. Visit Labelbox for more information.

This library is currently in beta. It may contain errors or inaccuracies and may not function as well as commercially released software. Please report any issues/bugs via Github Issues.

Requirements
Installation
Documentation
Authentication
Contribution

Requirements

Databricks: Runtime 10.4 LTS or Later
Apache Spark: 3.1.2 or Later
Labelbox account
Generate a Labelbox API key

Installation

Install LabelSpark to your cluster by uploading a Python Wheel to the cluster, or via notebook-scoped library installation in the notebook. The installation will also add the Labelbox SDK, a requirement for LabelSpark to function. LabelSpark is available via pypi:

pip install labelspark

Documentation

Please consult the demo notebook in the "Notebooks" directory. LabelSpark includes 4 methods to help facilitate your workflow between Databricks and Labelbox.

Create your dataset in Labelbox from Databricks. You can specify an IAM integration if desired (or set to none with iam_integration=None). The below example creates a dataset with the default IAM integration set in the Labelbox account.

LB_dataset = labelspark.create_dataset(labelbox_client, spark_dataframe, dataset_name="Sample Dataset", 
                                      iam_integration='DEFAULT', metadata_index = dictionary_of_metadata_columns)

Where "spark_dataframe" is your dataframe of unstructured data with asset names and asset URLs in two columns, named "external_id" and "row_data" respectively. The metadata_index is an optional parameter if you wish to import metadata from other Spark table columns to Labelbox.

external_id	row_data	optional_metadata_field	optional_metadata_field_2	...
image1.jpg	https://url_to_your_asset/image1.jpg	"Example string 1"	1234	...
image2.jpg	https://url_to_your_asset/image2.jpg	"Example string 2"	88.8	...
image3.jpg	https://url_to_your_asset/image3.jpg	"Example string 3"	123.5	...

The metadata_index dictionary follows this pattern: {column_name: metadata_type} where metadata_type is one of the following strings: "enum", "string", "number", or "datetime". In the above example, the metadata_index_dictionary would look like this:

metadata_labelbox_data_types = {
  "optional_metadata_field" : "string",
  "optional_metadata_field_2" : "number"
  }

Note: This library will set reserved field "lb_integration_source" in Labelbox metadata to "Databricks" automatically. This allows for enhanced search capabilities in Labelbox Catalog, and more transparent data lineage.

Visit this page for more information about metadata in Labelbox.

Pull your raw annotations back into Databricks.

bronze_DF = labelspark.get_annotations(labelbox_client,"labelbox_project_id_here", spark, sc)

You can use the our flattener to flatten the "Label" JSON column into component columns, or use the silver table method to produce a more queryable table of your labeled assets. Both of these methods take in the bronze table of annotations from above:

flattened_bronze_DF = labelspark.flatten_bronze_table(bronze_DF)
queryable_silver_DF = labelspark.bronze_to_silver(bronze_DF)

How To Get Video Project Annotations

Because Labelbox Video projects can contain multiple videos, you must use the get_videoframe_annotations method to return an array of DataFrames for each video in your project. Each DataFrame contains frame-by-frame annotation for a video in the project:

bronze_video = labelspark.get_annotations(labelbox_client,"labelbox_video_project_id_here", spark, sc) 
video_dataframes = labelspark.get_videoframe_annotations(bronze_video, API_KEY, spark, sc)    #note this extra step for video projects

You may use standard LabelSpark methods iteratively to create your flattened bronze tables and silver tables:

flattened_bronze_video_dataframes = []
silver_video_dataframes = [] 
for frameset in video_dataframes: 
  flattened_bronze_video_dataframes.append(labelspark.flatten_bronze_table(frameset))
  silver_video_dataframes.append(labelspark.bronze_to_silver(frameset))

This is how you would display the first video's frames and annotations, in sorted order:

display(silver_video_dataframes[0]
        .join(bronze_video, ["DataRow ID"], "inner")
        .orderBy('frameNumber'), ascending = False)

While using LabelSpark, you will likely also use the Labelbox SDK (e.g. for programmatic ontology creation). These resources will help familiarize you with the Labelbox Python SDK:

Visit our docs to learn how the SDK works
Checkout our notebook examples to follow along with interactive tutorials
view our API reference.

Authentication

Labelbox uses API keys to validate requests. You can create and manage API keys on Labelbox. We recommend using the Databricks Secrets API to store your key. If you don't have the Secrets API, you can store your API key in a separate notebook ignored by version control.

Contribution

Please consult CONTRIB.md

Project details

Release history Release notifications | RSS feed

0.8.1

Jul 25, 2024

0.7.35

Oct 3, 2023

0.7.34

Sep 19, 2023

0.7.33

Jul 14, 2023

0.7.32

Jul 11, 2023

0.7.31

Jul 11, 2023

0.7.30

Jul 10, 2023

0.7.29

Jul 10, 2023

0.7.28

Mar 14, 2023

0.7.27

Mar 13, 2023

0.7.26

Feb 6, 2023

0.7.25

Jan 25, 2023

0.7.24

Jan 11, 2023

0.7.23

Jan 11, 2023

0.7.22

Jan 11, 2023

0.7.21

Jan 10, 2023

0.7.20

Jan 10, 2023

0.7.19

Jan 10, 2023

0.7.18

Jan 10, 2023

0.7.17

Jan 10, 2023

0.7.16

Jan 10, 2023

0.7.15

Jan 10, 2023

0.7.14

Jan 10, 2023

0.7.13

Jan 10, 2023

0.7.12

Jan 3, 2023

0.7.11

Dec 23, 2022

0.7.6

Jul 25, 2024

0.7.1

Dec 23, 2022

0.7.0

Dec 20, 2022

0.6.45

Dec 20, 2022

0.6.44

Dec 20, 2022

0.6.43

Dec 20, 2022

0.6.42

Dec 20, 2022

0.6.41

Dec 20, 2022

0.6.40

Dec 20, 2022

0.6.39

Dec 20, 2022

0.6.38

Dec 20, 2022

0.6.37

Dec 20, 2022

0.6.36

Dec 20, 2022

0.6.35

Dec 20, 2022

0.6.34

Dec 20, 2022

0.6.33

Dec 20, 2022

0.6.32

Dec 20, 2022

0.6.30

Dec 20, 2022

0.6.29

Dec 20, 2022

0.6.28

Dec 20, 2022

0.6.27

Dec 19, 2022

0.6.26

Dec 19, 2022

0.6.25

Dec 19, 2022

0.6.23

Dec 19, 2022

0.6.22

Dec 19, 2022

0.6.21

Dec 19, 2022

0.6.20

Dec 17, 2022

0.6.19

Dec 17, 2022

0.6.18

Dec 17, 2022

0.6.17

Dec 17, 2022

0.6.16

Dec 17, 2022

0.6.15

Dec 17, 2022

0.6.14

Dec 17, 2022

0.6.13

Dec 17, 2022

0.6.12

Dec 17, 2022

0.6.11

Dec 17, 2022

0.6.1

Dec 17, 2022

0.5.949

Dec 17, 2022

0.5.948

Dec 17, 2022

0.5.947

Dec 17, 2022

0.5.946

Dec 17, 2022

0.5.945

Dec 17, 2022

0.5.944

Dec 17, 2022

0.5.943

Dec 17, 2022

0.5.942

Dec 17, 2022

This version

0.5.941

Dec 17, 2022

0.5.95

Dec 17, 2022

0.5.94

Dec 17, 2022

0.5.93

Dec 17, 2022

0.5.92

Dec 17, 2022

0.5.91

Dec 17, 2022

0.5.9

Dec 17, 2022

0.5.8

Dec 17, 2022

0.5.7

Dec 17, 2022

0.5.6

Oct 27, 2022

0.5.5

Oct 27, 2022

0.5.4

Oct 27, 2022

0.5.3

Sep 9, 2022

0.5.2

Aug 12, 2022

0.5.1

Jul 30, 2022

0.4.3

Apr 21, 2022

0.4.1

Jan 29, 2022

0.4.0

Oct 11, 2021

0.3.0

Jul 26, 2021

0.2.1

Jul 1, 2021

0.2.0

Jun 24, 2021

0.1.0

Jun 24, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

labelspark-0.5.941.tar.gz (24.3 kB view hashes)

Uploaded Dec 17, 2022 Source

Built Distribution

labelspark-0.5.941-py3-none-any.whl (28.2 kB view hashes)

Uploaded Dec 17, 2022 Python 3

Hashes for labelspark-0.5.941.tar.gz

Hashes for labelspark-0.5.941.tar.gz
Algorithm	Hash digest
SHA256	`41b90d5ed14507572afa9c77b99e6960523466ef63af4d47a0ba7d5d43c0b0e2`
MD5	`68e662ce0afa12ad64724b15db05f904`
BLAKE2b-256	`c476bab31288b817b2a6ecd739da49b71c504c57a0c71773af7ebb7a4acb9e76`

Hashes for labelspark-0.5.941-py3-none-any.whl

Hashes for labelspark-0.5.941-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f067d42810e47a56dacd99815baf04c8a89905cc29b1ae89d75f3767da9c5602`
MD5	`ca8bccd1f900075612c23913d6b0abd9`
BLAKE2b-256	`cf147c6d1cd4a78620d8dd994dddfe92202c10db48791d793e4179965af43ad3`