Skip to main content

Shared dataset loading and prompt formatting for implicit-personalization projects

Project description

persona-data

Docs PyPI Dataset

Dataset loaders and prompt utilities for the implicit-personalization research effort, built around SynthPersona — an open synthetic-persona dataset for studying, steering, and personalizing language models.

The SynthPersona dataset

implicit-personalization/synth-persona is a fully open synthetic persona dataset (~1.41 GB, English) for research on implicit personalization, persona steering, and persona-grounded evaluation.

  • 1,000 personas built from structured seed attributes and expanded into biographies, interview transcripts, and supporting statements, plus a baseline_assistant control.
  • 788k QA rows across three axes:
    • type: explicit (supported by a seed/interview/statement) vs. implicit (inferred from the biography).
    • scope: individual (one persona) vs. shared (same item across all personas, directly comparable).
    • item_type: FRQ (free-response, for training) vs. MCQ (multiple-choice, for evaluation).
  • Shared MCQ banks: 418 implicit + 57 explicit items reused across personas, with a curated study_model_evaluable_v1 subset (231 items) for 7B-scale evaluation.
  • 18 topic groups (e.g. future_hopes_and_values, stress_coping_and_support) for sliced analyses.
  • Leakage-aware splits: each MCQ tracks its source FRQs/seeds (bank_id, related_frq_qids), so FRQ-train / MCQ-test splits avoid contamination.
QA rows Implicit / FRQ Explicit / FRQ Explicit / MCQ Implicit / Shared MCQ Explicit / Shared MCQ
Count 40,000 174,336 98,156 418,000 57,000
Per persona 40 ~174 ~98 418 (shared bank) 57 (shared bank)

See the dataset card for the full schema.

Installation

pip install persona-data    # or: uv add persona-data

The dataset is downloaded from Hugging Face on first use and cached locally.

Quick start

from persona_data.synth_persona import SynthPersonaDataset
from persona_data.prompts import format_prompt, format_messages

dataset = SynthPersonaDataset()
persona = dataset[0]

system_prompt = format_prompt(persona, "biography")
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Where did you grow up?"},
]

# Leakage-aware split: individual FRQs for train, shared MCQs for test.
train_qa, test_qa = dataset.train_test_split(persona.id)

# Slice by topic or by curated evaluation subset.
religion = dataset.get_qa(persona.id, type="implicit",
                          topic_group_id="religion_spirituality_and_meaning")
eval_mc  = dataset.get_qa(persona.id, item_type="mcq",
                          question_set="study_model_evaluable_v1")

# Minimal-pair counterfactual: same persona, one attribute swapped in the
# templated view (binary attributes default to the opposite value).
from persona_data.templated import swap_attribute
base, swapped = swap_attribute(dataset, persona.id, "speak_other_language")

Pass sample_size=N to load only the first N personas.

What else is in the package

  • SynthPersonaDataset — personas + QA pairs (docs)
  • NemotronPersonasFranceDataset / NemotronPersonasUSADataset — NVIDIA persona-only datasets (docs)
  • prompts — roleplay and multiple-choice formatting helpers (docs)
  • templated — single-attribute counterfactual swaps on the templated view (docs)
  • environmentset_seed, get_device, get_artifacts_dir

Full API reference: https://implicit-personalization.github.io/persona-data/.

Used by

Citation

If you use SynthPersona, please cite the dataset card and link back to this repo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

persona_data-0.7.0.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

persona_data-0.7.0-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file persona_data-0.7.0.tar.gz.

File metadata

  • Download URL: persona_data-0.7.0.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for persona_data-0.7.0.tar.gz
Algorithm Hash digest
SHA256 f592b8d58e3c4154de69c66f2a9c099ecabf2655ad5f2de9ff2a7bc809ec3149
MD5 3afd320ec3c60d0a785c24a69270cff1
BLAKE2b-256 ff7f7b2da3be51f5fadb38ffe3c403b785a93a16ec6b700f0f825f74ba1229a7

See more details on using hashes here.

File details

Details for the file persona_data-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: persona_data-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for persona_data-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1c2a88e37045c536444e1704f0b4f2d884cfa82d407a7d1d119c60c52b14a199
MD5 535a96d2a25e8f9c10f3bfdffa677280
BLAKE2b-256 7f09dfda2ca7add73cd8ed405433541bc2b3b1c062fac92f20255e6f6a04caa7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page