Skip to main content

Pinecone Datasets lets you easily load datasets into your Pinecone index.

Project description

Pinecone Datasets

Usage

You can use Pinecone Datasets to load our public datasets or with your own dataset.

Loading Pinecone Public Datasets

from datasets import list_datasets, load_dataset

list_datasets()
# ["cc-news_msmarco-MiniLM-L6-cos-v5", ... ]

dataset = load_dataset("cc-news_msmarco-MiniLM-L6-cos-v5")

dataset.head()

# Prints
 ┌─────┬───────────────────────────┬─────────────────────────────────────┬───────────────────┬──────┐
  id   values                     sparse_values                        metadata           blob 
  ---  ---                        ---                                  ---                ---  
  str  list[f32]                  struct[2]                            struct[3]               
 ╞═════╪═══════════════════════════╪═════════════════════════════════════╪═══════════════════╪══════╡
  0    [0.118014, -0.069717, ...  {[470065541, 52922727, ... 22364...  {2017,12,"other"}  .... 
       0.0060...                                                                               
 └─────┴───────────────────────────┴─────────────────────────────────────┴───────────────────┴──────┘

Iterating over a Dataset documents

# List Iterator, where every list of size N Dicts with ("id", "metadata", "values", "sparse_values")
dataset.iter_documents(batch_size=n) 

upserting to Index

pip install pinecone-client
import pinecone
pinecone.init(api_key="API_KEY", environment="us-west1-gcp")

pinecone.create_index(name="my-index", dimension=384, pod_type='s1')

index = pinecone.Index("my-index")

# Or: Iterating over documents in batches
for batch in dataset.iter_documents(batch_size=100):
    index.upsert(vectors=batch)

upserting to an index with GRPC

Simply use GRPCIndex and do:

index = pinecone.GRPCIndex("my-index")

# Iterating over documents in batches
for batch in dataset.iter_documents(batch_size=100):
    index.upsert(vectors=batch)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pinecone_datasets-0.2.2a0.tar.gz (5.2 kB view hashes)

Uploaded Source

Built Distribution

pinecone_datasets-0.2.2a0-py3-none-any.whl (5.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page