Shared dataset loading and prompt formatting for implicit-personalization projects
Project description
persona-data
Dataset loaders and prompt utilities for the implicit-personalization research effort, built around SynthPersona — an open synthetic-persona dataset for studying, steering, and personalizing language models.
The SynthPersona dataset
implicit-personalization/synth-persona
is a fully open synthetic persona dataset (~1.41 GB, English) for research on implicit personalization, persona
steering, and persona-grounded evaluation.
- 1,000 personas built from structured seed attributes and expanded into biographies, interview transcripts,
and supporting statements, plus a
baseline_assistantcontrol. - 788k QA rows across three axes:
type: explicit (supported by a seed/interview/statement) vs. implicit (inferred from the biography).scope: individual (one persona) vs. shared (same item across all personas, directly comparable).item_type: FRQ (free-response, for training) vs. MCQ (multiple-choice, for evaluation).
- Shared MCQ banks: 418 implicit + 57 explicit items reused across personas, with a curated
study_model_evaluable_v1subset (231 items) for 7B-scale evaluation. - 18 topic groups (e.g.
future_hopes_and_values,stress_coping_and_support) for sliced analyses. - Leakage-aware splits: each MCQ tracks its source FRQs/seeds (
bank_id,related_frq_qids), so FRQ-train / MCQ-test splits avoid contamination.
| QA rows | Implicit / FRQ | Explicit / FRQ | Explicit / MCQ | Implicit / Shared MCQ | Explicit / Shared MCQ |
|---|---|---|---|---|---|
| Count | 40,000 | 174,336 | 98,156 | 418,000 | 57,000 |
| Per persona | 40 | ~174 | ~98 | 418 (shared bank) | 57 (shared bank) |
See the dataset card for the full schema.
Installation
pip install persona-data # or: uv add persona-data
The dataset is downloaded from Hugging Face on first use and cached locally.
Quick start
from persona_data.synth_persona import SynthPersonaDataset
from persona_data.prompts import format_prompt, format_messages
dataset = SynthPersonaDataset()
persona = dataset[0]
system_prompt = format_prompt(persona, "biography")
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Where did you grow up?"},
]
# Leakage-aware split: individual FRQs for train, shared MCQs for test.
train_qa, test_qa = dataset.train_test_split(persona.id)
# Slice by topic or by curated evaluation subset.
religion = dataset.get_qa(persona.id, type="implicit",
topic_group_id="religion_spirituality_and_meaning")
eval_mc = dataset.get_qa(persona.id, item_type="mcq",
question_set="study_model_evaluable_v1")
# Minimal-pair counterfactual: same persona, one attribute swapped in the
# templated view (binary attributes default to the opposite value).
from persona_data.templated import swap_attribute
base, swapped = swap_attribute(dataset, persona.id, "speak_other_language")
Pass sample_size=N to load only the first N personas.
What else is in the package
SynthPersonaDataset— personas + QA pairs (docs)NemotronPersonasFranceDataset/NemotronPersonasUSADataset— NVIDIA persona-only datasets (docs)prompts— roleplay and multiple-choice formatting helpers (docs)templated— single-attribute counterfactual swaps on the templated view (docs)environment—set_seed,get_device,get_artifacts_dir
Full API reference: https://implicit-personalization.github.io/persona-data/.
Used by
- persona-vectors — activation extraction and steering
- persona-2-lora — LoRA-based persona internalization
Citation
If you use SynthPersona, please cite the dataset card and link back to this repo.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file persona_data-0.7.0.tar.gz.
File metadata
- Download URL: persona_data-0.7.0.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f592b8d58e3c4154de69c66f2a9c099ecabf2655ad5f2de9ff2a7bc809ec3149
|
|
| MD5 |
3afd320ec3c60d0a785c24a69270cff1
|
|
| BLAKE2b-256 |
ff7f7b2da3be51f5fadb38ffe3c403b785a93a16ec6b700f0f825f74ba1229a7
|
File details
Details for the file persona_data-0.7.0-py3-none-any.whl.
File metadata
- Download URL: persona_data-0.7.0-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c2a88e37045c536444e1704f0b4f2d884cfa82d407a7d1d119c60c52b14a199
|
|
| MD5 |
535a96d2a25e8f9c10f3bfdffa677280
|
|
| BLAKE2b-256 |
7f09dfda2ca7add73cd8ed405433541bc2b3b1c062fac92f20255e6f6a04caa7
|