Pinecone Datasets lets you easily load datasets into your Pinecone index.
Project description
Pinecone Datasets
install
pip install pinecone-datasets
Usage
You can use Pinecone Datasets to load our public datasets or with your own dataset.
Loading Pinecone Public Datasets
from pinecone_datasets import list_datasets, load_dataset
list_datasets()
# ["quora_all-MiniLM-L6-bm25", ... ]
dataset = load_dataset("quora_all-MiniLM-L6-bm25")
dataset.head()
# Prints
# ┌─────┬───────────────────────────┬─────────────────────────────────────┬───────────────────┬──────┐
# │ id ┆ values ┆ sparse_values ┆ metadata ┆ blob │
# │ ┆ ┆ ┆ ┆ │
# │ str ┆ list[f32] ┆ struct[2] ┆ struct[3] ┆ │
# ╞═════╪═══════════════════════════╪═════════════════════════════════════╪═══════════════════╪══════╡
# │ 0 ┆ [0.118014, -0.069717, ... ┆ {[470065541, 52922727, ... 22364... ┆ {2017,12,"other"} ┆ .... │
# │ ┆ 0.0060... ┆ ┆ ┆ │
# └─────┴───────────────────────────┴─────────────────────────────────────┴───────────────────┴──────┘
Iterating over a Dataset documents and queries
Iterating over documents is useful for upserting but also for different updating. Iterating over queries is helpful in benchmarking
# List Iterator, where every list of size N Dicts with ("id", "metadata", "values", "sparse_values")
dataset.iter_documents(batch_size=n)
dataset.iter_queries()
upserting to Index
pip install pinecone-client
import pinecone
pinecone.init(api_key="API_KEY", environment="us-west1-gcp")
pinecone.create_index(name="my-index", dimension=384, pod_type='s1')
index = pinecone.Index("my-index")
# you can iterate over documents in batches
for batch in dataset.iter_documents(batch_size=100):
index.upsert(vectors=batch)
# or upsert the dataset as dataframe
index.upsert_from_dataframe(dataset.drop(columns=["blob"]))
# using gRPC
index = pinecone.GRPCIndex("my-index")
Advanced Usage
Working with your own dataset storage
Datasets is using Pinecone's public datasets bucket on GCS, you can use your own bucket by setting the DATASETS_CATALOG_BASEPATH
environment variable.
export PINECONE_DATASETS_ENDPOINT="gs://my-bucket"
this will change the default endpoint to your bucket, and upon calling list_datasets
or load_dataset
it will scan your bucket and list all datasets.
Note that you can also use s3://
as a prefix to your bucket.
Authenication to your own bucket
For now, Pinecone Datastes supports only GCS and S3 buckets, and with default authentication as provided by the fsspec implementation, respectively: gcsfs
and s3fs
.
Using aws key/secret authentication methods
first, to set a new endpoint, set the environment variable PINECONE_DATASETS_ENDPOINT
to your bucket.
export PINECONE_DATASETS_ENDPOINT="s3://my-bucket"
then, you can use the key
and secret
parameters to pass your credentials to the list_datasets
and load_dataset
functions.
st = list_datasets(
key=os.environ.get("S3_ACCESS_KEY"),
secret=os.environ.get("S3_SECRET"),
)
ds = load_dataset(
"test_dataset",
key=os.environ.get("S3_ACCESS_KEY"),
secret=os.environ.get("S3_SECRET"),
)
For developers
This project is using poetry for dependency managemet. supported python version are 3.8+. To start developing, on project root directory run:
poetry install --with dev
To run test locally run
poetry run pytest --cov pinecone_datasets
To create a pinecone-public dataset you may need to generate a dataset metadata. For example:
from pinecone_datasets.catalog import DatasetMetadata
meta = DatasetMetadata(
name="test_dataset",
created_at="2023-02-17 14:17:01.481785",
documents=2,
queries=2,
source="manual",
bucket="LOCAL",
task="unittests",
dense_model={"name": "bert", "dimension": 3},
sparse_model={"name": "bm25"},
)
to see the complete schema you can run:
meta.schema()
in order to list a dataset you can save dataset metadata (NOTE: write permission to loacaion is needed)
dataset = Dataset("non-listed-dataset")
dataset._save_metadata(meta)
Uploading and listing a dataset.
pinecone datasets can load dataset from every storage where it has access (using the default access: s3, gcs or local permissions)
we expect data to be uploaded to the following directory structure:
├── base_path # path to where all datasets
│ ├── dataset_id # name of dataset
│ │ ├── metadata.json # dataset metadata (optional, only for listed)
│ │ ├── documents # datasets documents
│ │ │ ├── file1.parquet
│ │ │ └── file2.parquet
│ │ ├── queries # dataset queries
│ │ │ ├── file1.parquet
│ │ │ └── file2.parquet
└── ...
a listed dataset is a dataset that is loaded and listed using load_dataset
and list_dataset
pinecone datasets will scan storage and will list every dataset with metadata file, for example: s3://my-bucket/my-dataset/metadata.json
Accessing a non-listed dataset
to access a non listed dataset you can directly load it via:
from pinecone_datasets import Dataset
dataset = Dataset("non-listed-dataset")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pinecone_datasets-0.4.0a0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | baa79c306b1b3072aeae77d8ec1c7214e1703c745342bb19f5bc81930c0b8201 |
|
MD5 | 1bede1d1d04d8c6c402cdae6c03e3de3 |
|
BLAKE2b-256 | 64b33f6461b55b474b0f5c6a74958320cf840c4ea0651a3214330e1d0b98d6e6 |
Hashes for pinecone_datasets-0.4.0a0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fcaf0b4544c578789d9460e49f0d8222ae2daf854913a66db08096f586f447ab |
|
MD5 | 68a3bf2da838455da42312901e054521 |
|
BLAKE2b-256 | a7d39fd94fe8eb437f68a60494d033331eb3dae02efadd2f9a4f71912e473b97 |