Skip to main content

Pinecone Datasets lets you easily load datasets into your Pinecone index.

Project description

Pinecone Datasets

install

pip install pinecone-datasets

Usage - Loading

You can use Pinecone Datasets to load our public datasets or with your own datasets. Datasets library can be used in 2 main ways: ad-hoc loading of datasets from a path or as a catalog loader for datasets.

Loading Pinecone Public Datasets (catalog)

Pinecone hosts a public datasets catalog, you can load a dataset by name using list_datasets and load_dataset functions. This will use the default catalog endpoint (currently GCS) to list and load datasets.

from pinecone_datasets import list_datasets, load_dataset

list_datasets()
# ["quora_all-MiniLM-L6-bm25", ... ]

dataset = load_dataset("quora_all-MiniLM-L6-bm25")

dataset.head()

# Prints
# ┌─────┬───────────────────────────┬─────────────────────────────────────┬───────────────────┬──────┐
# │ id  ┆ values                    ┆ sparse_values                       ┆ metadata          ┆ blob │
# │     ┆                           ┆                                     ┆                   ┆      │
# │ str ┆ list[f32]                 ┆ struct[2]                           ┆ struct[3]         ┆      │
# ╞═════╪═══════════════════════════╪═════════════════════════════════════╪═══════════════════╪══════╡
# │ 0   ┆ [0.118014, -0.069717, ... ┆ {[470065541, 52922727, ... 22364... ┆ {2017,12,"other"} ┆ .... │
# │     ┆ 0.0060...                 ┆                                     ┆                   ┆      │
# └─────┴───────────────────────────┴─────────────────────────────────────┴───────────────────┴──────┘

Expected dataset structure

pinecone datasets can load dataset from every storage where it has access (using the default access: s3, gcs or local permissions)

we expect data to be uploaded to the following directory structure:

├── my-subdir                     # path to where all datasets
│   ├── my-dataset                # name of dataset
│   │   ├── metadata.json         # dataset metadata (optional, only for listed)
│   │   ├── documents             # datasets documents
│   │   │   ├── file1.parquet      
│   │   │   └── file2.parquet      
│   │   ├── queries               # dataset queries
│   │   │   ├── file1.parquet  
│   │   │   └── file2.parquet   
└── ...

The data schema is expected to be as follows:

  • documents directory contains parquet files with the following schema:
    • Mandatory: id: str, values: list[float]
    • Optional: sparse_values: Dict: indices: List[int], values: List[float], metadata: Dict, blob: dict
      • note: blob is a dict that can contain any data, it is not returned when iterating over the dataset and is inteded to be used for storing additional data that is not part of the dataset schema. however, it is sometime useful to store additional data in the dataset, for example, a document text. In future version this may become a first class citizen in the dataset schema.
  • queries directory contains parquet files with the following schema:
    • Mandatory: vector: list[float], top_k: int
    • Optional: sparse_vector: Dict: indices: List[int], values: List[float], filter: Dict
      • note: filter is a dict that contain pinecone filters, for more information see here

in addition, a metadata file is expected to be in the dataset directory, for example: s3://my-bucket/my-dataset/metadata.json

from pinecone_datasets.catalog import DatasetMetadata

meta = DatasetMetadata(
    name="test_dataset",
    created_at="2023-02-17 14:17:01.481785",
    documents=2,
    queries=2,
    source="manual",
    bucket="LOCAL",
    task="unittests",
    dense_model={"name": "bert", "dimension": 3},
    sparse_model={"name": "bm25"},
)

full metadata schema can be found in pinecone_datasets.catalog.DatasetMetadata.schema

Loading your own dataset from catalog

To set you own catalog endpoint, set the environment variable DATASETS_CATALOG_BASEPATH to your bucket. Note that pinecone uses the default authentication method for the storage type (gcsfs for GCS and s3fs for S3).

export DATASETS_CATALOG_BASEPATH="s3://my-bucket/my-subdir"
from pinecone_datasets import list_datasets, load_dataset

list_datasets()

# ["my-dataset", ... ]

dataset = load_dataset("my-dataset")

additionally, you can load a dataset from the Dataset class

from pinecone_datasets import Dataset

dataset = Dataset.from_catalog("my-dataset")

Loading your own dataset from path

You can load your own dataset from a local path or a remote path (GCS or S3). Note that pinecone uses the default authentication method for the storage type (gcsfs for GCS and s3fs for S3).

from pinecone_datasets import Dataset

dataset = Dataset.from_path("s3://my-bucket/my-subdir/my-dataset")

This assumes that the path is structured as described in the Expected dataset structure section

Loading from a pandas dataframe

Pinecone Datasets enables you to load a dataset from a pandas dataframe. This is useful for loading a dataset from a local file and saving it to a remote storage. The minimal required data is a documents dataset, and the minimal required columns are id and values. The id column is a unique identifier for the document, and the values column is a list of floats representing the document vector.

import pandas as pd

df = pd.read_parquet("my-dataset.parquet")

metadata = DatasetMetadata(**metadata_dict)

dataset = Dataset.from_pandas(documents = df, queries = None, metadata = metadata)

Please check the documentation for more information on the expected dataframe schema. There's also a column mapping variable that can be used to map the dataframe columns to the expected schema.

Usage - Accessing data

Pinecone Datasets is build on top of pandas. This means that you can use all the pandas API to access the data. In addition, we provide some helper functions to access the data in a more convenient way.

Accessing documents and queries dataframes

accessing the documents and queries dataframes is done using the documents and queries properties. These properties are lazy and will only load the data when accessed.

document_df: pd.DataFrame = dataset.documents

query_df: pd.DataFrame = dataset.queries

Usage - Iterating

One of the main use cases for Pinecone Datasets is iterating over a dataset. This is useful for upserting a dataset to an index, or for benchmarking. It is also useful for iterating over large datasets - as of today, datasets are not yet lazy, however we are working on it.

# List Iterator, where every list of size N Dicts with ("id", "values", "sparse_values", "metadata")
dataset.iter_documents(batch_size=n) 

# Dict Iterator, where every dict has ("vector", "sparse_vector", "filter", "top_k")
dataset.iter_queries()

The 'blob' column

Pinecone dataset ship with a blob column which is inteneded to be used for storing additional data that is not part of the dataset schema. however, it is sometime useful to store additional data in the dataset, for example, a document text. We added a utility function to move data from the blob column to the metadata column. This is useful for example when upserting a dataset to an index and want to use the metadata to store text data.

from pinecone_datasets import import_documents_keys_from_blob_to_metadata

new_dataset = import_documents_keys_from_blob_to_metadata(dataset, keys=["text"])

Usage saving

you can save your dataset to a catalog managed by you or to a local path or a remote path (GCS or S3).

Saving to Catalog

To set you own catalog endpoint, set the environment variable DATASETS_CATALOG_BASEPATH to your bucket. Note that pinecone uses the default authentication method for the storage type (gcsfs for GCS and s3fs for S3).

After this environment variable is set you can save your dataset to the catalog using the save function

from pinecone_datasets import Dataset

metadata = DatasetMetadata(**{"name": "my-dataset", ...})

🚨 NOTE Dataset name in the metadata must match the dataset_id parameter you pass to the catalog, in this example 'my-dataset'


dataset = Dataset.from_pandas(documents, queries, metadata)
dataset.to_catalog("my-dataset")

Saving to Path

You can save your dataset to a local path or a remote path (GCS or S3). Note that pinecone uses the default authentication method for the storage type (gcsfs for GCS and s3fs for S3).

dataset = Dataset.from_pandas(documents, queries, metadata)
dataset.to_path("s3://my-bucket/my-subdir/my-dataset")

upserting to Index

When upserting a Dataset to an Index, only the document data will be upserted to the index. The queries data will be ignored.

TODO: add example for API Key adn Environment Variables

ds = load_dataset("dataset_name")

ds.to_pinecone_index("index_name")

# or, if you run in notebook environment

await ds.to_pinecone_index_async("index_name")

the to_pinecone_index function also accepts additional parameters:

  • batch_size for controlling the upserting process
  • api_key for passing your API key, otherwise you can
  • kwargs - for passing additional parameters to the index creation process

For developers

This project is using poetry for dependency managemet. supported python version are 3.8+. To start developing, on project root directory run:

poetry install --with dev

To run test locally run

poetry run pytest --cov pinecone_datasets

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pinecone_datasets-0.7.0.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

pinecone_datasets-0.7.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file pinecone_datasets-0.7.0.tar.gz.

File metadata

  • Download URL: pinecone_datasets-0.7.0.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.10.12 Linux/6.2.0-1018-azure

File hashes

Hashes for pinecone_datasets-0.7.0.tar.gz
Algorithm Hash digest
SHA256 4c7b3fafc9184e481e45a59bd71f343c8ec761e6015a8b433bd705d313e079b3
MD5 2d0cec5099d91dbd33169a0633609e41
BLAKE2b-256 ece2fb69a91e3e29a803c6b7d76936bc5b52ee1a51162ce2569da999cee94eac

See more details on using hashes here.

File details

Details for the file pinecone_datasets-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: pinecone_datasets-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.10.12 Linux/6.2.0-1018-azure

File hashes

Hashes for pinecone_datasets-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1bcb7adb75ef5015b09c2398e2474b0b5dea300d20a7fb5ab79b42f0738e63eb
MD5 ef48e7029db02dccb788216ba90e1af3
BLAKE2b-256 ba6d62d3a757c5c0806078895a0f2b23d33edd977cb51ae233d313580927ffcb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page