Skip to main content

Unofficial demo datasets for Weaviate

Project description

UNOFFICIAL Weaviate demo data uploader

This is an educational project that aims to make it easy to upload demo data to your instance of Weaviate. The target audience is developers learning how to use Weaviate.

Usage

pip install weaviate-demo-datasets

All datasets are based on the Dataset superclass, which includes a number of built-in methods to make it easier to work with it.

Each dataset includes a default vectorizer configuration for convenience, which can be:

  • viewed via the .get_class_definitions method and
  • changed via the .set_vectorizer method. The target Weaviate instance must include the specified vectorizer module.

Once you instantiate a dataset, you can upload it to Weaviate with the following:

import weaviate_datasets
dataset = weaviate_datasets.JeopardyQuestions10k()  # Instantiate dataset
dataset.upload_dataset(client)  # Add class to schema & upload objects (uses batch uploads by default)

Where client is the instantiated weaviate.Client object, such as:

import weaviate
import os
import json

wv_url = "https://some-endpoint.weaviate.network"
api_key = os.environ.get("OPENAI_API_KEY")

# If authentication required (e.g. using WCS)
auth = weaviate.AuthClientPassword(
    username=os.environ.get("WCS_USER"),
    password=os.environ.get("WCS_PASS"),
)

client = weaviate.Client(
    url=wv_url,
    auth_client_secret=auth,  # If authentication required
    additional_headers={"X-OpenAI-Api-Key": api_key},  # If using OpenAI inference
)

Built-in methods

  • .upload_dataset(client) - add defined classes to schema, adds objects

  • .get_class_definitions(): See the schema definition to be added

  • .get_class_names(): See class names in the dataset

  • .get_sample(): See a sample data object

  • .classes_in_schema(client): Check whether each class is already in the Weaviate schema

  • .delete_existing_dataset_classes(client): If dataset classes are already in the Weaviate instance, delete them from the Weaviate instance.

  • .set_vectorizer(vectorizer_name, module_config): Set the vectorizer and corresponding module configuration for the dataset. Datasets come pre-configured with a vectorizer & module configuration.

Available classes

Not including vectors

  • WikiArticles (A handful of Wikipedia summaries)
  • WineReviews (50 wine reviews)

Including vectors

  • JeopardyQuestions1k (1,000 Jeopardy questions & answers, vectorized with OpenAI text-embedding-ada-002)
  • JeopardyQuestions10k (10,000 Jeopardy questions & answers, vectorized with OpenAI text-embedding-ada-002)
  • NewsArticles (News articles, including their corresponding publications, authors & categories, vectorized with OpenAI text-embedding-ada-002)

Data sources

https://www.kaggle.com/datasets/zynicide/wine-reviews https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions https://github.com/weaviate/DEMO-NewsPublications

Source code

https://github.com/databyjp/wv_demo_uploader

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

weaviate-demo-datasets-0.1.0.tar.gz (71.2 MB view details)

Uploaded Source

Built Distribution

weaviate_demo_datasets-0.1.0-py3-none-any.whl (75.8 MB view details)

Uploaded Python 3

File details

Details for the file weaviate-demo-datasets-0.1.0.tar.gz.

File metadata

File hashes

Hashes for weaviate-demo-datasets-0.1.0.tar.gz
Algorithm Hash digest
SHA256 65e3b56ed71af66a8273c95b3bcdc7a295ed74a1bca478a4c381091646dff2d6
MD5 a6bacabdb60bec200567aa3133ec227b
BLAKE2b-256 d7ce7187edf00d6c8ba2a6b69b029d6f80359d8c20fe424e7e668b01297613e1

See more details on using hashes here.

File details

Details for the file weaviate_demo_datasets-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for weaviate_demo_datasets-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ab1ead9a05fe919e0b461c94ffc117d65f29390a8ea9ec11a6de06d8bf27589
MD5 7954617199bafbbd81a1e7f5bd6ed86e
BLAKE2b-256 16e04c9ef8ee8566923a1ed25db82920249e038b80b5d63913e30cefb746a54a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page