Skip to main content

Load datasets to explore Pinecone

Project description

Pinecone Datasets

install

pip install pinecone-datasets

Loading public datasets

Pinecone hosts a public datasets catalog, you can load a dataset by name using list_datasets and load_dataset functions. This will use the default catalog endpoint (currently GCS) to list and load datasets.

from pinecone_datasets import list_datasets, load_dataset

list_datasets()
# ["quora_all-MiniLM-L6-bm25", ... ]

dataset = load_dataset("quora_all-MiniLM-L6-bm25")

dataset.head()

# Prints
# ┌─────┬───────────────────────────┬─────────────────────────────────────┬───────────────────┬──────┐
# │ id  ┆ values                    ┆ sparse_values                       ┆ metadata          ┆ blob │
# │     ┆                           ┆                                     ┆                   ┆      │
# │ str ┆ list[f32]                 ┆ struct[2]                           ┆ struct[3]         ┆      │
# ╞═════╪═══════════════════════════╪═════════════════════════════════════╪═══════════════════╪══════╡
# │ 0   ┆ [0.118014, -0.069717, ... ┆ {[470065541, 52922727, ... 22364... ┆ {2017,12,"other"} ┆ .... │
# │     ┆ 0.0060...                 ┆                                     ┆                   ┆      │
# └─────┴───────────────────────────┴─────────────────────────────────────┴───────────────────┴──────┘

Usage - Accessing data

Each dataset has three main attributes, documents, queries, and metadata which are lazily loaded the first time they are accessed. You may notice a delay as the underlying parquet files are being downloaded the first time these attributes are accessed.

Pinecone Datasets is build on top of pandas. documents and queries are lazily-loaded pandas dataframes. This means that you can use all the pandas API to access the data. In addition, we provide some helper functions to access the data in a more convenient way.

accessing the documents and queries dataframes is done using the documents and queries properties. These properties are lazy and will only load the data when accessed.

from pinecone_datasets import list_datasets, load_dataset

dataset = load_dataset("quora_all-MiniLM-L6-bm25")

document_df: pd.DataFrame = dataset.documents

query_df: pd.DataFrame = dataset.queries

Usage - Iterating over documents

The Dataset class has helpers for iterating over your dataset. This is useful for upserting a dataset to an index, or for benchmarking.

# List Iterator, where every list of size N Dicts with ("id", "values", "sparse_values", "metadata")
dataset.iter_documents(batch_size=n) 

# Dict Iterator, where every dict has ("vector", "sparse_vector", "filter", "top_k")
dataset.iter_queries()

Upserting to Index

To upsert data to the index, you should install the Pinecone SDK

from pinecone import Pinecone, ServerlessSpec
from pinecone_datasets import load_dataset, list_datasets

# See what datasets are available
for ds in list_datasets():
    print(ds)

# Download embeddings data 
dataset = load_dataset(dataset_name)

# Instantiate a Pinecone client using API key from app.pinecone.io
pc = Pinecone(api_key='key')

# Create a Pinecone index
index_config = pc.create_index(
    name="demo-index",
    dimension=dataset.metadata.dense_model.dimension,
    spec=ServerlessSpec(cloud="aws", region="us-east1")
)

# Instantiate an index client
index = pc.Index(host=index_config.host)

# Upsert data from the dataset
index.upsert_from_dataframe(df=dataset.documents)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pinecone_datasets-1.0.2.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pinecone_datasets-1.0.2-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file pinecone_datasets-1.0.2.tar.gz.

File metadata

  • Download URL: pinecone_datasets-1.0.2.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.12.3 Linux/6.8.0-1021-azure

File hashes

Hashes for pinecone_datasets-1.0.2.tar.gz
Algorithm Hash digest
SHA256 75f97fa4fe913583ecf9d7115ab0f324dcf0211658c472394b57657a0be78e50
MD5 9cb1aeb404286067a1b3dd0d04d97785
BLAKE2b-256 4d46ad45dc3c1d5236f2f11c29cb9b656a11af4ce7e9dd807e86223cb03976a2

See more details on using hashes here.

File details

Details for the file pinecone_datasets-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: pinecone_datasets-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 12.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.12.3 Linux/6.8.0-1021-azure

File hashes

Hashes for pinecone_datasets-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e0be0e2a6ab720d9a43243882f7f352af9cebf62193dadf6f2fa4188d012d072
MD5 98b758f6d9bbf88088b68e59c82420fc
BLAKE2b-256 ca94e594376517c2152560144e122f4274c69bbb2558f2b3dc2b5f2b6dec21fc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page