llama-index packs diff private simple dataset
Project description
LlamaIndex Packs: DiffPrivateSimpleDatasetPack
The DiffPrivateSimpleDatasetPack
llama pack creates differentially private synthetic
examples from an original, sensitive dataset.
Differential Privacy is a privacy preserving technique that obscures source data while preserving original attributes, while minimizing the performance impact on processes that consume the data.
The main motivation for this pack is thus to provide the means to create privacy safe versions of datasets that can be used in subsequent downstream processing (i.e., in a prompt to be passed to an LLM) steps. As noted in the original paper (linked below), the synthetic observations can be used as many times as one desires without any additional privacy costs!
The paper appeared at ICLR 2024 and is entitled: PRIVACY-PRESERVING IN-CONTEXT LEARNING WITH DIFFERENTIALLY PRIVATE FEW-SHOT GENERATION.
How it works?
The pack operates on a dataset represented with the LabelledSimpleDataset
type.
This type consists of examples called LabelledSimpleDataExample
, which is a data
class that contains two fields, namely: text
and reference_label
. For example,
a news dataset may have example text
s with reference_labels
belonging to
{"World", "Business", "Sports", etc.}
.
The output of this pack's run()
(and arun()
) method is another LabelledSimpleDataset
,
but represents privacy-safe, synthetically generated examples.
Supported LLMs
To use this pack, an LLM that produces LogProbs
must be used as it is used in
the differential-privacy generation logic for the next token. The demos found in
the examples
folder use OpenAI
completion LLMs (chat completion LLMs were
also used, but these did not produce quality results.)
CLI Usage
You can download llamapacks directly using llamaindex-cli
, which comes installed with the llama-index
python package:
llamaindex-cli download-llamapack DiffPrivateSimpleDatasetPack --download-dir ./pack
You can then inspect the files at ./pack
and use them as a template for your own project!
Code Usage
You can download the pack from PyPi and then use it your llama-index applications.
pip install llama-index-packs-diff-private-simple-dataset
A DiffPrivateSimpleDatasetPack object is constructed with the following params:
- an
LLM
(must returnCompletionResponse
), - its associated
tokenizer
, - a
PromptBundle
object that contains the parameters required for prompting the LLM to produce the synthetic observations - a
LabelledSimpleDataset
- [Optional]
sephamore_counter_size
used to help reduce chances of experiencing aRateLimitError
when calling the LLM's completions API. - [Optional]
sleep_time_in_seconds
used to help reduce chances of experiencing aRateLimitError
when calling the LLM"s completions API.
from llama_index.packs.diff_private_simple_dataset import (
DiffPrivateSimpleDatasetPack,
)
from llama_index.packs.diff_private_simple_dataset.base import PromptBundle
llm = ...
tokenizer = ...
prompt = PromptBundle(instruction=..., text_heading=..., label_heading=...)
dp_simple_dataset_pack = DiffPrivateSimpleDatasetPack(
llm=llm,
tokenizer=tokenizer,
prompt_bundle=prompt_bundle,
simple_dataset=simple_dataset,
)
If you would like to customize this pack further, then you can download it as a template:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_pack import download_llama_pack
# download and install dependencies
DiffPrivateSimpleDatasetPack = download_llama_pack(
"DiffPrivateSimpleDatasetPack", "./dense_pack"
)
dp_simple_dataset_pack = DiffPrivateSimpleDatasetPack(
llm=llm,
tokenizer=tokenizer,
prompt_bundle=prompt_bundle,
simple_dataset=simple_dataset,
)
The run()
function is a light wrapper around query_engine.query()
. A few
params are required:
t_max
: The max number of tokens you would like to generate (the algorithm adds some logic per token in order to satisfy differential privacy).sigma
: Controls the variance of the noise distribution associated with differential privacy noise mechanism. A value ofsigma
amounts to a level ofepsilon
satisfied in differential privacy.num_splits
: The differential privacy algorithm implemented here relies on disjoint splits of the original dataset.num_samples_per_split
: The number of private, in-context examples to include in the generation of the synthetic example.
synthetic_dataset = dp_simple_dataset_pack.run(
sizes={"World": 1, "Sports": 1, "Sci/Tech": 0, "Business": 0},
t_max=10, #
sigma=0.5,
num_splits=2,
num_samples_per_split=8,
)
print(response)
Examples
- See examples/basic_demo folder for a notebook the consists of a basic demo
on how to use the
DiffPrivateSimpleDatasetPack
. - Also see examples/symptom_2_disease for a more Python program that generates a synthetic version of the Symptom2Disease dataset.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for llama_index_packs_diff_private_simple_dataset-0.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b493aa3aabb52123480a99a3f7ca7052c67cfe4e97b0308e4a3bbfb75d97971 |
|
MD5 | c851851eef44c1c42aef7b75af6212bd |
|
BLAKE2b-256 | 312abd6ea9abcf69a59b790cd91960a47b2fc2737b8b76c6fd728f4888bd4e5d |
Hashes for llama_index_packs_diff_private_simple_dataset-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d60a6ac2167f806724649b674485788be15ddf90ee447f4d68902ed6572f3686 |
|
MD5 | 01b172e06407ad5d25a0b7adf8a89565 |
|
BLAKE2b-256 | 6c18158cf35c0031f9d3bda6f9a19d658a93e459000766400bfe5091cfa2d306 |