Skip to main content

Unofficial demo datasets for Weaviate

Project description

UNOFFICIAL Weaviate demo data uploader

This is an educational project that aims to make it easy to upload demo data to your instance of Weaviate. The target audience is developers learning how to use Weaviate.

Usage

pip install -U weaviate-demo-datasets

Each dataset includes a default vectorizer configuration for convenience. The target Weaviate instance must include the specified vectorizer module.

Once you instantiate a dataset, you can upload it to Weaviate with the following:

import weaviate_datasets as wd
dataset = wd.JeopardyQuestions1k()  # Instantiate dataset
dataset.upload_dataset(client)  # Pass the Weaviate client instance

Where client is the instantiated weaviate.WeaviateClient object, such as:

import weaviate
import os

client = weaviate.connect_to_local(
    headers={"X-OpenAI-Api-Key": os.getenv("OPENAI_APIKEY")}
)

To use a weaviate.Client object, as used in the Weaviate Python client v3.x, import the dataset class from weaviate_datasets.v3.

import weaviate_datasets.v3_datasets as wd_v3
dataset = wd_v3.JeopardyQuestions1k()  # Instantiate dataset
dataset.upload_dataset(client)  # Pass the Weaviate client instance

Built-in methods

  • .upload_dataset(client) - add defined classes to schema, adds objects
  • .get_sample() - yields sample data object(s)

Available classes

  • Wiki100 (Top 100 Wikipedia articles)

    • WikiChunk collection
    • Various chunking options available:
      • Default: wiki_sections (sections of the Wikipedia article)
      • wiki_section_chunked (sections of the Wikipedia article, chunked into 200 character chunks)
      • wiki_heading_only (only the headings of the Wikipedia article sections)
      • fixed (fixed length chunks of 200 characters)
    • Use it as follows:
      d = wd.Wiki100()
      d.collection_name = "WikiChunk"
      d.set_chunking("wiki_section_chunked")
      upload_responses = d.upload_dataset(client, overwrite=True)
      
  • WineReviews (50 wine reviews)

    • WineReview collection
  • WineReviewsNV (50 wine reviews)

    • WineReviewNV collection, with named vectors ("title", "review_body", and "title_country")
      • "title_country" -> Vector from concatenation of "title" + "country"
  • WineReviewsMT (50 wine reviews)

    • WineReviewMT collection, tenants tenantA and tenantB
  • JeopardyQuestions1k (1,000 Jeopardy questions & answers, vectorized with OpenAI text-embedding-ada-002)

    • JeopardyQuestion and JeopardyCategory collections
  • JeopardyQuestions10k (10,000 Jeopardy questions & answers, vectorized with OpenAI text-embedding-ada-002)

    • JeopardyQuestion and JeopardyCategory collections

Available classes - V3 collection

These are available with a V3 suffix, and are compatible with the Weaviate Python client v3.x.

Not including vectors

  • WineReviews (50 wine reviews)
  • WineReviewsMT (50 wine reviews, multi-tenancy enabled)

Including vectors

  • JeopardyQuestions1k (1,000 Jeopardy questions & answers, vectorized with OpenAI text-embedding-ada-002)
  • JeopardyQuestions10k (10,000 Jeopardy questions & answers, vectorized with OpenAI text-embedding-ada-002)
  • JeopardyQuestions1kMT (1,000 Jeopardy questions & answers, multi-tenancy enabled, vectorized with OpenAI text-embedding-ada-002)
  • NewsArticles (News articles, including their corresponding publications, authors & categories, vectorized with OpenAI text-embedding-ada-002)

Data sources

https://www.kaggle.com/datasets/zynicide/wine-reviews https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions https://github.com/weaviate/DEMO-NewsPublications

Source code

https://github.com/databyjp/wv_demo_uploader

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

weaviate_demo_datasets-0.5.0.tar.gz (67.8 MB view details)

Uploaded Source

Built Distribution

weaviate_demo_datasets-0.5.0-py3-none-any.whl (72.1 MB view details)

Uploaded Python 3

File details

Details for the file weaviate_demo_datasets-0.5.0.tar.gz.

File metadata

File hashes

Hashes for weaviate_demo_datasets-0.5.0.tar.gz
Algorithm Hash digest
SHA256 aae0112036084a41182e6fd7c2d49e43c37b4adf34d4af97c990be0107c251b5
MD5 b331d811a1fb6f44f334fc1a28b86463
BLAKE2b-256 0333f1b66a76f2c12ceb4b569e5b9a595f708c4f6a7942d1e2e25e2867b41411

See more details on using hashes here.

File details

Details for the file weaviate_demo_datasets-0.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for weaviate_demo_datasets-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 86380bf24b2c19099aa3a52a44db0c7a8c8e4cead3a767be6d7f68b11b951abe
MD5 0f9f98c8f012b00f1f6a846f0a48848d
BLAKE2b-256 a001c701ed6e965ee13be956c16519cd948d6d433dd04d8e0e12f596cfb60209

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page