Skip to main content

llama-index packs diff private simple dataset

Project description

LlamaIndex Packs: DiffPrivateSimpleDatasetPack

The DiffPrivateSimpleDatasetPack llama pack creates differentially private synthetic examples from an original, sensitive dataset.

Differential Privacy is a privacy preserving technique that obscures source data while preserving original attributes, while minimizing the performance impact on processes that consume the data.

The main motivation for this pack is thus to provide the means to create privacy safe versions of datasets that can be used in subsequent downstream processing (i.e., in a prompt to be passed to an LLM) steps. As noted in the original paper (linked below), the synthetic observations can be used as many times as one desires without any additional privacy costs!

The paper appeared at ICLR 2024 and is entitled: PRIVACY-PRESERVING IN-CONTEXT LEARNING WITH DIFFERENTIALLY PRIVATE FEW-SHOT GENERATION.

How it works?

The pack operates on a dataset represented with the LabelledSimpleDataset type. This type consists of examples called LabelledSimpleDataExample, which is a data class that contains two fields, namely: text and reference_label. For example, a news dataset may have example texts with reference_labels belonging to {"World", "Business", "Sports", etc.}.

The output of this pack's run() (and arun()) method is another LabelledSimpleDataset, but represents privacy-safe, synthetically generated examples.

Supported LLMs

To use this pack, an LLM that produces LogProbs must be used as it is used in the differential-privacy generation logic for the next token. The demos found in the examples folder use OpenAI completion LLMs (chat completion LLMs were also used, but these did not produce quality results.)

CLI Usage

You can download llamapacks directly using llamaindex-cli, which comes installed with the llama-index python package:

llamaindex-cli download-llamapack DiffPrivateSimpleDatasetPack --download-dir ./pack

You can then inspect the files at ./pack and use them as a template for your own project!

Code Usage

You can download the pack from PyPi and then use it your llama-index applications.

pip install llama-index-packs-diff-private-simple-dataset

A DiffPrivateSimpleDatasetPack object is constructed with the following params:

  1. an LLM (must return CompletionResponse),
  2. its associated tokenizer,
  3. a PromptBundle object that contains the parameters required for prompting the LLM to produce the synthetic observations
  4. a LabelledSimpleDataset
  5. [Optional] sephamore_counter_size used to help reduce chances of experiencing a RateLimitError when calling the LLM's completions API.
  6. [Optional] sleep_time_in_seconds used to help reduce chances of experiencing a RateLimitError when calling the LLM"s completions API.
from llama_index.packs.diff_private_simple_dataset import (
    DiffPrivateSimpleDatasetPack,
)
from llama_index.packs.diff_private_simple_dataset.base import PromptBundle

llm = ...
tokenizer = ...
prompt = PromptBundle(instruction=..., text_heading=..., label_heading=...)

dp_simple_dataset_pack = DiffPrivateSimpleDatasetPack(
    llm=llm,
    tokenizer=tokenizer,
    prompt_bundle=prompt_bundle,
    simple_dataset=simple_dataset,
)

If you would like to customize this pack further, then you can download it as a template:

from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_pack import download_llama_pack

# download and install dependencies
DiffPrivateSimpleDatasetPack = download_llama_pack(
    "DiffPrivateSimpleDatasetPack", "./dense_pack"
)

dp_simple_dataset_pack = DiffPrivateSimpleDatasetPack(
    llm=llm,
    tokenizer=tokenizer,
    prompt_bundle=prompt_bundle,
    simple_dataset=simple_dataset,
)

The run() function is a light wrapper around query_engine.query(). A few params are required:

  • t_max: The max number of tokens you would like to generate (the algorithm adds some logic per token in order to satisfy differential privacy).
  • sigma: Controls the variance of the noise distribution associated with differential privacy noise mechanism. A value of sigma amounts to a level of epsilon satisfied in differential privacy.
  • num_splits: The differential privacy algorithm implemented here relies on disjoint splits of the original dataset.
  • num_samples_per_split: The number of private, in-context examples to include in the generation of the synthetic example.
synthetic_dataset = dp_simple_dataset_pack.run(
    sizes={"World": 1, "Sports": 1, "Sci/Tech": 0, "Business": 0},
    t_max=10,  #
    sigma=0.5,
    num_splits=2,
    num_samples_per_split=8,
)

print(response)

Examples

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file llama_index_packs_diff_private_simple_dataset-0.1.0.tar.gz.

File metadata

File hashes

Hashes for llama_index_packs_diff_private_simple_dataset-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4b493aa3aabb52123480a99a3f7ca7052c67cfe4e97b0308e4a3bbfb75d97971
MD5 c851851eef44c1c42aef7b75af6212bd
BLAKE2b-256 312abd6ea9abcf69a59b790cd91960a47b2fc2737b8b76c6fd728f4888bd4e5d

See more details on using hashes here.

File details

Details for the file llama_index_packs_diff_private_simple_dataset-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_packs_diff_private_simple_dataset-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d60a6ac2167f806724649b674485788be15ddf90ee447f4d68902ed6572f3686
MD5 01b172e06407ad5d25a0b7adf8a89565
BLAKE2b-256 6c18158cf35c0031f9d3bda6f9a19d658a93e459000766400bfe5091cfa2d306

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page