Shared dataset loading and prompt formatting for implicit-personalization projects

Project description

persona-data

Dataset loaders and prompt utilities for the implicit-personalization research effort, built around SynthPersona — an open synthetic-persona dataset for studying, steering, and personalizing language models.

The SynthPersona dataset

implicit-personalization/synth-persona is a fully open synthetic persona dataset (~1.41 GB, English) for research on implicit personalization, persona steering, and persona-grounded evaluation.

1,000 personas built from structured seed attributes and expanded into biographies, interview transcripts, and supporting statements, plus a baseline_assistant control.
788k QA rows across three axes:
- type: explicit (supported by a seed/interview/statement) vs. implicit (inferred from the biography).
- scope: individual (one persona) vs. shared (same item across all personas, directly comparable).
- item_type: FRQ (free-response, for training) vs. MCQ (multiple-choice, for evaluation).
Shared MCQ banks: 418 implicit + 57 explicit items reused across personas, with a curated study_model_evaluable_v1 subset (231 items) for 7B-scale evaluation.
18 topic groups (e.g. future_hopes_and_values, stress_coping_and_support) for sliced analyses.
Leakage-aware splits: each MCQ tracks its source FRQs/seeds (bank_id, related_frq_qids), so FRQ-train / MCQ-test splits avoid contamination.

QA rows	Implicit / FRQ	Explicit / FRQ	Explicit / MCQ	Implicit / Shared MCQ	Explicit / Shared MCQ
Count	40,000	174,336	98,156	418,000	57,000
Per persona	40	~174	~98	418 (shared bank)	57 (shared bank)

See the dataset card for the full schema.

Installation

pip install persona-data    # or: uv add persona-data

The dataset is downloaded from Hugging Face on first use and cached locally.

Quick start

from persona_data.synth_persona import SynthPersonaDataset
from persona_data.prompts import format_prompt, format_messages

dataset = SynthPersonaDataset()
persona = dataset[0]

system_prompt = format_prompt(persona, "biography")
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Where did you grow up?"},
]

# Leakage-aware split: individual FRQs for train, shared MCQs for test.
train_qa, test_qa = dataset.train_test_split(persona.id)

# Slice by topic or by curated evaluation subset.
religion = dataset.get_qa(persona.id, type="implicit",
                          topic_group_id="religion_spirituality_and_meaning")
eval_mc  = dataset.get_qa(persona.id, item_type="mcq",
                          question_set="study_model_evaluable_v1")

# Minimal-pair counterfactual: same persona, one attribute swapped in the
# templated view (binary attributes default to the opposite value).
from persona_data.templated import swap_attribute
base, swapped = swap_attribute(dataset, persona.id, "speak_other_language")

Pass sample_size=N to load only the first N personas.

What else is in the package

SynthPersonaDataset — personas + QA pairs (docs)
NemotronPersonasFranceDataset / NemotronPersonasUSADataset — NVIDIA persona-only datasets (docs)
prompts — roleplay and multiple-choice formatting helpers (docs)
templated — single-attribute counterfactual swaps on the templated view (docs)
environment — set_seed, get_device, get_artifacts_dir

Full API reference: https://implicit-personalization.github.io/persona-data/.

Used by

persona-vectors — activation extraction and steering
persona-2-lora — LoRA-based persona internalization

Citation

If you use SynthPersona, please cite the dataset card and link back to this repo.

Project details

Release history Release notifications | RSS feed

This version

0.7.0

Jun 25, 2026

0.6.0

May 16, 2026

0.5.2

May 14, 2026

0.5.1

May 13, 2026

0.5.0

May 12, 2026

0.4.2

May 8, 2026

0.4.1

May 7, 2026

0.4.0

May 6, 2026

0.3.4

May 4, 2026

0.3.3

May 3, 2026

0.3.1

May 3, 2026

0.2.7

May 1, 2026

0.2.6

May 1, 2026

0.2.5

Apr 29, 2026

0.2.4

Apr 29, 2026

0.2.3

Apr 29, 2026

0.2.2

Apr 20, 2026

0.2.1

Apr 20, 2026

0.2.0

Apr 20, 2026

0.1.0

Apr 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

persona_data-0.7.0.tar.gz (13.3 kB view details)

Uploaded Jun 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

persona_data-0.7.0-py3-none-any.whl (16.3 kB view details)

Uploaded Jun 25, 2026 Python 3

File details

Details for the file persona_data-0.7.0.tar.gz.

File metadata

Download URL: persona_data-0.7.0.tar.gz
Upload date: Jun 25, 2026
Size: 13.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for persona_data-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`f592b8d58e3c4154de69c66f2a9c099ecabf2655ad5f2de9ff2a7bc809ec3149`
MD5	`3afd320ec3c60d0a785c24a69270cff1`
BLAKE2b-256	`ff7f7b2da3be51f5fadb38ffe3c403b785a93a16ec6b700f0f825f74ba1229a7`

See more details on using hashes here.

File details

Details for the file persona_data-0.7.0-py3-none-any.whl.

File metadata

Download URL: persona_data-0.7.0-py3-none-any.whl
Upload date: Jun 25, 2026
Size: 16.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for persona_data-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1c2a88e37045c536444e1704f0b4f2d884cfa82d407a7d1d119c60c52b14a199`
MD5	`535a96d2a25e8f9c10f3bfdffa677280`
BLAKE2b-256	`7f09dfda2ca7add73cd8ed405433541bc2b3b1c062fac92f20255e6f6a04caa7`

See more details on using hashes here.

persona-data 0.7.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

persona-data

The SynthPersona dataset

Installation

Quick start

What else is in the package

Used by

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes