Skip to main content

Unofficial demo datasets for Weaviate

Project description

UNOFFICIAL Weaviate demo data uploader

This is an educational project that aims to make it easy to upload demo data to your instance of Weaviate. The target audience is developers learning how to use Weaviate.

Usage

pip install weaviate-demo-datasets

All datasets are based on the Dataset superclass, which includes a number of built-in methods to make it easier to work with it.

Each dataset includes a default vectorizer configuration for convenience, which can be:

  • viewed via the .get_class_definitions method and
  • changed via the .set_vectorizer method. The target Weaviate instance must include the specified vectorizer module.

Once you instantiate a dataset, you can upload it to Weaviate with the following:

import weaviate_datasets
dataset = weaviate_datasets.JeopardyQuestions1k()  # Instantiate dataset
dataset.upload_dataset(client)  # Add class to schema & upload objects (uses batch uploads by default)

Where client is the instantiated weaviate.Client object, such as:

import weaviate
import os

wv_url = "https://some-endpoint.weaviate.network"
api_key = os.environ.get("OPENAI_API_KEY")

# If authentication required (e.g. using WCS)
auth = weaviate.AuthApiKey("your-weaviate-apikey")

client = weaviate.Client(
    url=wv_url,
    auth_client_secret=auth,  # If authentication required
    additional_headers={"X-OpenAI-Api-Key": api_key},  # If using OpenAI inference
)

Built-in methods

  • .upload_dataset(client) - add defined classes to schema, adds objects
  • .get_class_definitions(): See the schema definition to be added
  • .get_class_names(): See class names in the dataset
  • .get_sample(): See a sample data object
  • .classes_in_schema(client): Check whether each class is already in the Weaviate schema
  • .delete_existing_dataset_classes(client): If dataset classes are already in the Weaviate instance, delete them from the Weaviate instance.
  • .set_vectorizer(vectorizer_name, module_config): Set the vectorizer and corresponding module configuration for the dataset. Datasets come pre-configured with a vectorizer & module configuration.

Available classes

Not including vectors

  • WikiArticles (A handful of Wikipedia summaries)
  • WineReviews (50 wine reviews)
  • WineReviewsMT (50 wine reviews, multi-tenancy enabled)

Including vectors

  • JeopardyQuestions1k (1,000 Jeopardy questions & answers, vectorized with OpenAI text-embedding-ada-002)
  • JeopardyQuestions1kMT (1,000 Jeopardy questions & answers, multi-tenancy enabled, vectorized with OpenAI text-embedding-ada-002)
  • JeopardyQuestions10k (10,000 Jeopardy questions & answers, vectorized with OpenAI text-embedding-ada-002)
  • NewsArticles (News articles, including their corresponding publications, authors & categories, vectorized with OpenAI text-embedding-ada-002)

Data sources

https://www.kaggle.com/datasets/zynicide/wine-reviews https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions https://github.com/weaviate/DEMO-NewsPublications

Source code

https://github.com/databyjp/wv_demo_uploader

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

weaviate-demo-datasets-0.2.0.tar.gz (71.2 MB view details)

Uploaded Source

Built Distribution

weaviate_demo_datasets-0.2.0-py3-none-any.whl (75.8 MB view details)

Uploaded Python 3

File details

Details for the file weaviate-demo-datasets-0.2.0.tar.gz.

File metadata

File hashes

Hashes for weaviate-demo-datasets-0.2.0.tar.gz
Algorithm Hash digest
SHA256 25eb63985d3933ebe47c292f9d0d0637196230f725e9d142747c78eea88403e2
MD5 4cfc938386e05373906fa843d4a60ac8
BLAKE2b-256 b7c99068b282c82057ba81ab1216aa56354593c6344dd6e8109575a75620b22f

See more details on using hashes here.

File details

Details for the file weaviate_demo_datasets-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for weaviate_demo_datasets-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 89178bb99ce58a7559e5ba980782d22101fc741904fa8260eccc6d4f5f7f8659
MD5 41902742fc9604d0edb44522d4a247d3
BLAKE2b-256 2a54031c2f712b3c2517778ec8024bc4ff7343b28d70f5132b27d3d734813c04

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page