Load datasets to explore Pinecone
Project description
Pinecone Datasets
install
pip install pinecone-datasets
Loading public datasets
Pinecone hosts a public datasets catalog, you can load a dataset by name using list_datasets and load_dataset functions. This will use the default catalog endpoint (currently GCS) to list and load datasets.
from pinecone_datasets import list_datasets, load_dataset
list_datasets()
# ["quora_all-MiniLM-L6-bm25", ... ]
dataset = load_dataset("quora_all-MiniLM-L6-bm25")
dataset.head()
# Prints
# ┌─────┬───────────────────────────┬─────────────────────────────────────┬───────────────────┬──────┐
# │ id ┆ values ┆ sparse_values ┆ metadata ┆ blob │
# │ ┆ ┆ ┆ ┆ │
# │ str ┆ list[f32] ┆ struct[2] ┆ struct[3] ┆ │
# ╞═════╪═══════════════════════════╪═════════════════════════════════════╪═══════════════════╪══════╡
# │ 0 ┆ [0.118014, -0.069717, ... ┆ {[470065541, 52922727, ... 22364... ┆ {2017,12,"other"} ┆ .... │
# │ ┆ 0.0060... ┆ ┆ ┆ │
# └─────┴───────────────────────────┴─────────────────────────────────────┴───────────────────┴──────┘
Usage - Accessing data
Each dataset has three main attributes, documents, queries, and metadata which are lazily loaded the first time they are accessed. You may notice a delay as the underlying parquet files are being downloaded the first time these attributes are accessed.
Pinecone Datasets is build on top of pandas. documents and queries are lazily-loaded pandas dataframes. This means that you can use all the pandas API to access the data. In addition, we provide some helper functions to access the data in a more convenient way.
accessing the documents and queries dataframes is done using the documents and queries properties. These properties are lazy and will only load the data when accessed.
from pinecone_datasets import list_datasets, load_dataset
dataset = load_dataset("quora_all-MiniLM-L6-bm25")
document_df: pd.DataFrame = dataset.documents
query_df: pd.DataFrame = dataset.queries
Usage - Iterating over documents
The Dataset class has helpers for iterating over your dataset. This is useful for upserting a dataset to an index, or for benchmarking.
# List Iterator, where every list of size N Dicts with ("id", "values", "sparse_values", "metadata")
dataset.iter_documents(batch_size=n)
# Dict Iterator, where every dict has ("vector", "sparse_vector", "filter", "top_k")
dataset.iter_queries()
Upserting to Index
To upsert data to the index, you should install the Pinecone SDK
from pinecone import Pinecone, ServerlessSpec
from pinecone_datasets import load_dataset, list_datasets
# See what datasets are available
for ds in list_datasets():
print(ds)
# Download embeddings data
dataset = load_dataset(dataset_name)
# Instantiate a Pinecone client using API key from app.pinecone.io
pc = Pinecone(api_key='key')
# Create a Pinecone index
index_config = pc.create_index(
name="demo-index",
dimension=dataset.metadata.dense_model.dimension,
spec=ServerlessSpec(cloud="aws", region="us-east1")
)
# Instantiate an index client
index = pc.Index(host=index_config.host)
# Upsert data from the dataset
index.upsert_from_dataframe(df=dataset.documents)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pinecone_datasets-1.0.2.tar.gz.
File metadata
- Download URL: pinecone_datasets-1.0.2.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.0 CPython/3.12.3 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75f97fa4fe913583ecf9d7115ab0f324dcf0211658c472394b57657a0be78e50
|
|
| MD5 |
9cb1aeb404286067a1b3dd0d04d97785
|
|
| BLAKE2b-256 |
4d46ad45dc3c1d5236f2f11c29cb9b656a11af4ce7e9dd807e86223cb03976a2
|
File details
Details for the file pinecone_datasets-1.0.2-py3-none-any.whl.
File metadata
- Download URL: pinecone_datasets-1.0.2-py3-none-any.whl
- Upload date:
- Size: 12.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.0 CPython/3.12.3 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0be0e2a6ab720d9a43243882f7f352af9cebf62193dadf6f2fa4188d012d072
|
|
| MD5 |
98b758f6d9bbf88088b68e59c82420fc
|
|
| BLAKE2b-256 |
ca94e594376517c2152560144e122f4274c69bbb2558f2b3dc2b5f2b6dec21fc
|