Skip to main content

Unofficial demo datasets for Weaviate

Project description

UNOFFICIAL Weaviate demo data uploader

This is an educational project that aims to make it easy to upload demo data to your instance of Weaviate. The target audience is developers learning how to use Weaviate.

Usage

pip install weaviate-demo-datasets

All datasets are based on the Dataset superclass, which includes a number of built-in methods to make it easier to work with it.

Each dataset includes a default vectorizer configuration for convenience, which can be:

  • viewed via the .get_class_definitions method and
  • changed via the .set_vectorizer method. The target Weaviate instance must include the specified vectorizer module.

Once you instantiate a dataset, you can upload it to Weaviate with the following:

import weaviate_datasets
dataset = weaviate_datasets.JeopardyQuestions10k()  # Instantiate dataset
dataset.upload_dataset(client)  # Add class to schema & upload objects (uses batch uploads by default)

Where client is the instantiated weaviate.Client object, such as:

import weaviate
import os
import json

wv_url = "https://some-endpoint.weaviate.network"
api_key = os.environ.get("OPENAI_API_KEY")

# If authentication required (e.g. using WCS)
auth = weaviate.AuthClientPassword(
    username=os.environ.get("WCS_USER"),
    password=os.environ.get("WCS_PASS"),
)

client = weaviate.Client(
    url=wv_url,
    auth_client_secret=auth,  # If authentication required
    additional_headers={"X-OpenAI-Api-Key": api_key},  # If using OpenAI inference
)

Built-in methods

  • .add_to_schema(client) - add defined classes to schema; returns status & any classes already present

  • .upload_objects(client, batch_size) - adds objects; must specify batch size

  • .upload_dataset(client) - runs .add_to_schema and .upload_objects; default batch size 100

  • .get_class_definitions(): See the schema definition to be added

  • .get_class_names(): See class names in the dataset

  • .get_sample(): See a sample data object

  • .classes_in_schema(client): Check whether each class is already in the Weaviate schema

  • .delete_existing_dataset_classes(client): If dataset classes are already in the Weaviate instance, delete them from the Weaviate instance.

  • .set_vectorizer(vectorizer_name, module_config): Set the vectorizer and corresponding module configuration for the dataset. Datasets come pre-configured with a vectorizer & module configuration.

Available classes

Not including vectors

  • WikiArticles (A handful of Wikipedia summaries)
  • WineReviews (50 wine reviews)

Including vectors

  • WikiCities (500 large cities + Wikipedia summaries, vectorized with OpenAI text-embedding-ada-002)
  • JeopardyQuestions1k (1,000 Jeopardy questions & answers, vectorized with OpenAI text-embedding-ada-002)
  • JeopardyQuestions10k (10,000 Jeopardy questions & answers, vectorized with OpenAI text-embedding-ada-002)

Data sources

https://www.kaggle.com/datasets/zynicide/wine-reviews https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions

Source code

https://github.com/databyjp/wv_demo_uploader

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

weaviate-demo-datasets-0.0.14.tar.gz (71.2 MB view details)

Uploaded Source

Built Distribution

weaviate_demo_datasets-0.0.14-py3-none-any.whl (75.8 MB view details)

Uploaded Python 3

File details

Details for the file weaviate-demo-datasets-0.0.14.tar.gz.

File metadata

File hashes

Hashes for weaviate-demo-datasets-0.0.14.tar.gz
Algorithm Hash digest
SHA256 ec46628ad63c87b4fa385e0e27cdc74672e6f1c16e218be16ea0b81aa7a7e6b3
MD5 8ef50b4c9b722f6ad52e99b137c36f12
BLAKE2b-256 3ba9dadcaa21dd9890d148cd9b1f9595e89bfb2d1739c87cfb6d3c3f3820752e

See more details on using hashes here.

File details

Details for the file weaviate_demo_datasets-0.0.14-py3-none-any.whl.

File metadata

File hashes

Hashes for weaviate_demo_datasets-0.0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 a74b43aedf60fc6cd5a691b37e3668c673ba0072315c77dc5e01efa395e90b90
MD5 e012271e9a6354fce38b3e67f2c418e8
BLAKE2b-256 10540d1fec48fb4312871104639c75b9538a3cdc1a4dccb639d2c08ba69e5df0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page