Load datasets to explore Pinecone

These details have not been verified by PyPI

Project description

Pinecone Datasets

install

pip install pinecone-datasets

Loading public datasets

Pinecone hosts a public datasets catalog, you can load a dataset by name using list_datasets and load_dataset functions. This will use the default catalog endpoint (currently GCS) to list and load datasets.

from pinecone_datasets import list_datasets, load_dataset

list_datasets()
# ["quora_all-MiniLM-L6-bm25", ... ]

dataset = load_dataset("quora_all-MiniLM-L6-bm25")

dataset.head()

# Prints
# ┌─────┬───────────────────────────┬─────────────────────────────────────┬───────────────────┬──────┐
# │ id  ┆ values                    ┆ sparse_values                       ┆ metadata          ┆ blob │
# │     ┆                           ┆                                     ┆                   ┆      │
# │ str ┆ list[f32]                 ┆ struct[2]                           ┆ struct[3]         ┆      │
# ╞═════╪═══════════════════════════╪═════════════════════════════════════╪═══════════════════╪══════╡
# │ 0   ┆ [0.118014, -0.069717, ... ┆ {[470065541, 52922727, ... 22364... ┆ {2017,12,"other"} ┆ .... │
# │     ┆ 0.0060...                 ┆                                     ┆                   ┆      │
# └─────┴───────────────────────────┴─────────────────────────────────────┴───────────────────┴──────┘

Usage - Accessing data

Each dataset has three main attributes, documents, queries, and metadata which are lazily loaded the first time they are accessed. You may notice a delay as the underlying parquet files are being downloaded the first time these attributes are accessed.

Pinecone Datasets is build on top of pandas. documents and queries are lazily-loaded pandas dataframes. This means that you can use all the pandas API to access the data. In addition, we provide some helper functions to access the data in a more convenient way.

accessing the documents and queries dataframes is done using the documents and queries properties. These properties are lazy and will only load the data when accessed.

from pinecone_datasets import list_datasets, load_dataset

dataset = load_dataset("quora_all-MiniLM-L6-bm25")

document_df: pd.DataFrame = dataset.documents

query_df: pd.DataFrame = dataset.queries

Usage - Iterating over documents

The Dataset class has helpers for iterating over your dataset. This is useful for upserting a dataset to an index, or for benchmarking.

# List Iterator, where every list of size N Dicts with ("id", "values", "sparse_values", "metadata")
dataset.iter_documents(batch_size=n) 

# Dict Iterator, where every dict has ("vector", "sparse_vector", "filter", "top_k")
dataset.iter_queries()

Upserting to Index

To upsert data to the index, you should install the Pinecone SDK

from pinecone import Pinecone, ServerlessSpec
from pinecone_datasets import load_dataset, list_datasets

# See what datasets are available
for ds in list_datasets():
    print(ds)

# Download embeddings data 
dataset = load_dataset(dataset_name)

# Instantiate a Pinecone client using API key from app.pinecone.io
pc = Pinecone(api_key='key')

# Create a Pinecone index
index_config = pc.create_index(
    name="demo-index",
    dimension=dataset.metadata.dense_model.dimension,
    spec=ServerlessSpec(cloud="aws", region="us-east1")
)

# Instantiate an index client
index = pc.Index(host=index_config.host)

# Upsert data from the dataset
index.upsert_from_dataframe(df=dataset.documents)

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.2

Mar 14, 2025

1.0.1

Feb 28, 2025

1.0.0

Feb 28, 2025

1.0.0.dev3 pre-release

Feb 28, 2025

1.0.0.dev2 pre-release

Feb 26, 2025

1.0.0.dev1 pre-release

Feb 25, 2025

0.7.0

Jan 16, 2024

0.6.2

Oct 24, 2023

0.6.1

Aug 14, 2023

0.6.0

Aug 6, 2023

0.5.1

Jun 27, 2023

0.5.0rc11 pre-release

Jun 27, 2023

0.5.0rc10 pre-release

Jun 26, 2023

0.5.0rc9 pre-release

Jun 26, 2023

0.5.0rc8 pre-release

Jun 22, 2023

0.5.0rc7 pre-release

Jun 15, 2023

0.5.0rc6 pre-release

Jun 15, 2023

0.5.0rc5 pre-release

Jun 11, 2023

0.5.0rc4 pre-release

Jun 11, 2023

0.5.0rc3 pre-release

Jun 11, 2023

0.5.0rc1 pre-release

Jun 9, 2023

0.4.0a0 pre-release

Jun 6, 2023

0.3.1a0 pre-release

Mar 22, 2023

0.2.4a0 pre-release

Feb 23, 2023

0.2.3a0 pre-release

Feb 23, 2023

0.2.2a0 pre-release

Feb 21, 2023

0.1.5a0 pre-release yanked

Feb 20, 2023

Reason this release was yanked:

new version

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pinecone_datasets-1.0.2.tar.gz (10.5 kB view details)

Uploaded Mar 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pinecone_datasets-1.0.2-py3-none-any.whl (12.7 kB view details)

Uploaded Mar 14, 2025 Python 3

File details

Details for the file pinecone_datasets-1.0.2.tar.gz.

File metadata

Download URL: pinecone_datasets-1.0.2.tar.gz
Upload date: Mar 14, 2025
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.0 CPython/3.12.3 Linux/6.8.0-1021-azure

File hashes

Hashes for pinecone_datasets-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`75f97fa4fe913583ecf9d7115ab0f324dcf0211658c472394b57657a0be78e50`
MD5	`9cb1aeb404286067a1b3dd0d04d97785`
BLAKE2b-256	`4d46ad45dc3c1d5236f2f11c29cb9b656a11af4ce7e9dd807e86223cb03976a2`

See more details on using hashes here.

File details

Details for the file pinecone_datasets-1.0.2-py3-none-any.whl.

File metadata

Download URL: pinecone_datasets-1.0.2-py3-none-any.whl
Upload date: Mar 14, 2025
Size: 12.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.0 CPython/3.12.3 Linux/6.8.0-1021-azure

File hashes

Hashes for pinecone_datasets-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e0be0e2a6ab720d9a43243882f7f352af9cebf62193dadf6f2fa4188d012d072`
MD5	`98b758f6d9bbf88088b68e59c82420fc`
BLAKE2b-256	`ca94e594376517c2152560144e122f4274c69bbb2558f2b3dc2b5f2b6dec21fc`

See more details on using hashes here.

pinecone-datasets 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Pinecone Datasets

install

Loading public datasets

Usage - Accessing data

Usage - Iterating over documents

Upserting to Index

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes