Skip to main content

Unofficial demo datasets for Weaviate

Project description

UNOFFICIAL Weaviate demo data uploader

This is an educational project that aims to make it easy to upload demo data to your instance of Weaviate. The target audience is developers learning how to use Weaviate.

Usage

pip install weaviate-demo-datasets

All datasets are based on the Dataset superclass, which includes a number of built-in methods to make it easier to work with it.

Each dataset includes a default vectorizer configuration for convenience, which can be:

  • viewed via the .get_class_definitions method and
  • changed via the .set_vectorizer method. The target Weaviate instance must include the specified vectorizer module.

Once you instantiate a dataset, you can upload it to Weaviate with the following:

import weaviate_datasets
dataset = weaviate_datasets.JeopardyQuestions10k()  # Instantiate dataset
dataset.upload_dataset(client)  # Add class to schema & upload objects (uses batch uploads by default)

Where client is the instantiated weaviate.Client object, such as:

import weaviate
import os
import json

wv_url = "https://some-endpoint.weaviate.network"
api_key = os.environ.get("OPENAI_API_KEY")

# If authentication required (e.g. using WCS)
auth = weaviate.AuthClientPassword(
    username=os.environ.get("WCS_USER"),
    password=os.environ.get("WCS_PASS"),
)

client = weaviate.Client(
    url=wv_url,
    auth_client_secret=auth,  # If authentication required
    additional_headers={"X-OpenAI-Api-Key": api_key},  # If using OpenAI inference
)

Built-in methods

  • .upload_dataset(client) - add defined classes to schema, adds objects

  • .get_class_definitions(): See the schema definition to be added

  • .get_class_names(): See class names in the dataset

  • .get_sample(): See a sample data object

  • .classes_in_schema(client): Check whether each class is already in the Weaviate schema

  • .delete_existing_dataset_classes(client): If dataset classes are already in the Weaviate instance, delete them from the Weaviate instance.

  • .set_vectorizer(vectorizer_name, module_config): Set the vectorizer and corresponding module configuration for the dataset. Datasets come pre-configured with a vectorizer & module configuration.

Available classes

Not including vectors

  • WikiArticles (A handful of Wikipedia summaries)
  • WineReviews (50 wine reviews)

Including vectors

  • JeopardyQuestions1k (1,000 Jeopardy questions & answers, vectorized with OpenAI text-embedding-ada-002)
  • JeopardyQuestions10k (10,000 Jeopardy questions & answers, vectorized with OpenAI text-embedding-ada-002)
  • NewsArticles (News articles, including their corresponding publications, authors & categories, vectorized with OpenAI text-embedding-ada-002)

Data sources

https://www.kaggle.com/datasets/zynicide/wine-reviews https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions https://github.com/weaviate/DEMO-NewsPublications

Source code

https://github.com/databyjp/wv_demo_uploader

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

weaviate-demo-datasets-0.0.18.tar.gz (71.2 MB view details)

Uploaded Source

Built Distribution

weaviate_demo_datasets-0.0.18-py3-none-any.whl (75.8 MB view details)

Uploaded Python 3

File details

Details for the file weaviate-demo-datasets-0.0.18.tar.gz.

File metadata

File hashes

Hashes for weaviate-demo-datasets-0.0.18.tar.gz
Algorithm Hash digest
SHA256 bb98f2d3c7d5aa95b5a89ef9294a221256483a653f01910c2f93e80eec306315
MD5 55d3d8a6249f1ca078570c20124b8d39
BLAKE2b-256 00f72d50857cf865287d6ae30187f4b0a0f1b6daa68cff8d1ad93811606ab754

See more details on using hashes here.

File details

Details for the file weaviate_demo_datasets-0.0.18-py3-none-any.whl.

File metadata

File hashes

Hashes for weaviate_demo_datasets-0.0.18-py3-none-any.whl
Algorithm Hash digest
SHA256 ed77032cadc70479ebffdfc6eb03b0daaf0a19d8290836cdaeca78c87a0d0e6e
MD5 df03c2229a8b7c36016a0b3477f63f23
BLAKE2b-256 48f32160aa8308c2dc6ce8fb7460de5ad6d3c4e8392065cba2c34d03be32c902

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page