llama-index packs diff private simple dataset

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

LlamaIndex Packs: `DiffPrivateSimpleDatasetPack`

The DiffPrivateSimpleDatasetPack llama pack creates differentially private synthetic examples from an original, sensitive dataset.

Differential Privacy is a privacy preserving technique that obscures source data while preserving original attributes, while minimizing the performance impact on processes that consume the data.

The main motivation for this pack is thus to provide the means to create privacy safe versions of datasets that can be used in subsequent downstream processing (i.e., in a prompt to be passed to an LLM) steps. As noted in the original paper (linked below), the synthetic observations can be used as many times as one desires without any additional privacy costs!

The paper appeared at ICLR 2024 and is entitled: PRIVACY-PRESERVING IN-CONTEXT LEARNING WITH DIFFERENTIALLY PRIVATE FEW-SHOT GENERATION.

How it works?

The pack operates on a dataset represented with the LabelledSimpleDataset type. This type consists of examples called LabelledSimpleDataExample, which is a data class that contains two fields, namely: text and reference_label. For example, a news dataset may have example texts with reference_labels belonging to {"World", "Business", "Sports", etc.}.

The output of this pack's run() (and arun()) method is another LabelledSimpleDataset, but represents privacy-safe, synthetically generated examples.

Supported LLMs

To use this pack, an LLM that produces LogProbs must be used as it is used in the differential-privacy generation logic for the next token. The demos found in the examples folder use OpenAI completion LLMs (chat completion LLMs were also used, but these did not produce quality results.)

CLI Usage

You can download llamapacks directly using llamaindex-cli, which comes installed with the llama-index python package:

llamaindex-cli download-llamapack DiffPrivateSimpleDatasetPack --download-dir ./pack

You can then inspect the files at ./pack and use them as a template for your own project!

Code Usage

You can download the pack from PyPi and then use it your llama-index applications.

pip install llama-index-packs-diff-private-simple-dataset

A DiffPrivateSimpleDatasetPack object is constructed with the following params:

an LLM (must return CompletionResponse),
its associated tokenizer,
a PromptBundle object that contains the parameters required for prompting the LLM to produce the synthetic observations
a LabelledSimpleDataset
[Optional] sephamore_counter_size used to help reduce chances of experiencing a RateLimitError when calling the LLM's completions API.
[Optional] sleep_time_in_seconds used to help reduce chances of experiencing a RateLimitError when calling the LLM"s completions API.

from llama_index.packs.diff_private_simple_dataset import (
    DiffPrivateSimpleDatasetPack,
)
from llama_index.packs.diff_private_simple_dataset.base import PromptBundle

llm = ...
tokenizer = ...
prompt = PromptBundle(instruction=..., text_heading=..., label_heading=...)

dp_simple_dataset_pack = DiffPrivateSimpleDatasetPack(
    llm=llm,
    tokenizer=tokenizer,
    prompt_bundle=prompt_bundle,
    simple_dataset=simple_dataset,
)

If you would like to customize this pack further, then you can download it as a template:

from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_pack import download_llama_pack

# download and install dependencies
DiffPrivateSimpleDatasetPack = download_llama_pack(
    "DiffPrivateSimpleDatasetPack", "./dense_pack"
)

dp_simple_dataset_pack = DiffPrivateSimpleDatasetPack(
    llm=llm,
    tokenizer=tokenizer,
    prompt_bundle=prompt_bundle,
    simple_dataset=simple_dataset,
)

The run() function is a light wrapper around query_engine.query(). A few params are required:

t_max: The max number of tokens you would like to generate (the algorithm adds some logic per token in order to satisfy differential privacy).
sigma: Controls the variance of the noise distribution associated with differential privacy noise mechanism. A value of sigma amounts to a level of epsilon satisfied in differential privacy.
num_splits: The differential privacy algorithm implemented here relies on disjoint splits of the original dataset.
num_samples_per_split: The number of private, in-context examples to include in the generation of the synthetic example.

synthetic_dataset = dp_simple_dataset_pack.run(
    sizes={"World": 1, "Sports": 1, "Sci/Tech": 0, "Business": 0},
    t_max=10,  #
    sigma=0.5,
    num_splits=2,
    num_samples_per_split=8,
)

print(response)

Examples

See examples/basic_demo folder for a notebook the consists of a basic demo on how to use the DiffPrivateSimpleDatasetPack.
Also see examples/symptom_2_disease for a more Python program that generates a synthetic version of the Symptom2Disease dataset.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.0

Mar 18, 2024

0.1.0a0 pre-release

Mar 17, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_packs_diff_private_simple_dataset-0.1.0.tar.gz (10.6 kB view hashes)

Uploaded Mar 18, 2024 Source

Built Distribution

llama_index_packs_diff_private_simple_dataset-0.1.0-py3-none-any.whl (9.9 kB view hashes)

Uploaded Mar 18, 2024 Python 3

Hashes for llama_index_packs_diff_private_simple_dataset-0.1.0.tar.gz

Hashes for llama_index_packs_diff_private_simple_dataset-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4b493aa3aabb52123480a99a3f7ca7052c67cfe4e97b0308e4a3bbfb75d97971`
MD5	`c851851eef44c1c42aef7b75af6212bd`
BLAKE2b-256	`312abd6ea9abcf69a59b790cd91960a47b2fc2737b8b76c6fd728f4888bd4e5d`

Hashes for llama_index_packs_diff_private_simple_dataset-0.1.0-py3-none-any.whl

Hashes for llama_index_packs_diff_private_simple_dataset-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d60a6ac2167f806724649b674485788be15ddf90ee447f4d68902ed6572f3686`
MD5	`01b172e06407ad5d25a0b7adf8a89565`
BLAKE2b-256	`6c18158cf35c0031f9d3bda6f9a19d658a93e459000766400bfe5091cfa2d306`

llama-index-packs-diff-private-simple-dataset 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

LlamaIndex Packs: `DiffPrivateSimpleDatasetPack`

How it works?

Supported LLMs

CLI Usage

Code Usage

Examples

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

llama-index-packs-diff-private-simple-dataset 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

LlamaIndex Packs: DiffPrivateSimpleDatasetPack

How it works?

Supported LLMs

CLI Usage

Code Usage

Examples

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

LlamaIndex Packs: `DiffPrivateSimpleDatasetPack`