Shared dataset loading and prompt formatting for implicit-personalization projects

Project description

persona-data

Shared dataset loading, prompt formatting, and environment utilities for the implicit-personalization projects.

Overview

persona-data provides the common dataset and prompt helpers used across the persona projects:

SynthPersonaDataset for persona profiles plus QA pairs
PersonaGuessDataset for turn-based persona games
NemotronPersonasFranceDataset for French persona profiles from NVIDIA
NemotronPersonasUSADataset for US persona profiles from NVIDIA
prompt helpers for roleplay and multiple-choice evaluation
environment helpers for seeds, devices, and artifact paths

Installation

Add as a uv git source in your project's pyproject.toml:

[project]
dependencies = ["persona-data"]

[tool.uv.sources]
persona-data = { git = "ssh://git@github.com/implicit-personalization/persona-data.git" }

Then run uv sync.

For local development alongside other repos, use an editable path source:

[tool.uv.sources]
persona-data = { path = "../persona-data", editable = true }

Testing

uv run --with pytest pytest tests/test_datasets.py

The release workflow also runs tests/smoke_test.py against the built wheel and source distribution.

Package layout

src/persona_data/
├── __init__.py
├── synth_persona.py       # SynthPersonaDataset, PersonaDataset, PersonaData, QAPair, Statement
├── persona_guess.py       # PersonaGuessDataset, GameRecord, Turn
├── nemotron_personas.py   # NemotronPersonasFranceDataset, NemotronPersonasUSADataset
├── prompts.py             # format_prompt, format_mc_question, format_messages
└── environment.py         # set_seed, get_device, get_artifacts_dir

Datasets

Each dataset is a module with its own types and a loader that downloads from Hugging Face, cached via HF_HOME.

SynthPersona

from persona_data.synth_persona import SynthPersonaDataset

dataset = SynthPersonaDataset()

persona = dataset[0]
persona.name              # "Ethan Robinson"

qa_pairs = dataset.get_qa(persona.id, type="implicit", item_type="mcq")

# Leakage-aware split: train on individual FRQs, test on shared MCQs.
train_qa, test_qa = dataset.train_test_split(persona.id)

# Optional cap if you want a smaller train slice:
# train_qa, test_qa = dataset.train_test_split(persona.id, n_train=50)

PersonaGuess

from persona_data.persona_guess import PersonaGuessDataset

games = PersonaGuessDataset()
game = games[0]
turns = games.get_qa(game.game_id, player="A")

Prompt formatting

from persona_data.prompts import format_messages, format_prompt

system_prompt = format_prompt(persona, "biography")

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Where did you grow up?"},
    {"role": "assistant", "content": "I grew up in Little Rock, Arkansas."},
]
full_prompt, response_start_idx = format_messages(messages, tokenizer)

format_prompt accepts a PersonaData plus one of the standard variants ("templated" or "biography"), or raw profile text. It also accepts mode="roleplay" (default) and mode="conversational".

The persona-less Assistant baseline is just another persona in the dataset under BASELINE_PERSONA_ID ("baseline_assistant"). It appears in normal iteration when loaded, and dataset.baseline retrieves it directly:

dataset = SynthPersonaDataset()
baseline = dataset.baseline  # PersonaData | None
system_prompt = format_prompt(baseline, "templated")

Use BASELINE_PERSONA_ID and BASELINE_PERSONA_NAME from persona_data.synth_persona for artifact naming and UI labels.

For multiple-choice prompts, use format_mc_question(qa) to render the question, choices, and trailing answer-only instruction. Use mc_answer_only_instruction(n_choices) if you need just the instruction text, and mc_correct_letter(qa) to get the gold label.

format_messages handles tokenizers that do not support the "system" role (for example Gemma 2) by merging system content into the first user message. Pass add_generation_prompt=True to render an inference-ready prompt (messages ending in a user turn); the returned response_start_idx then equals the prompt length, ready to slice model.generate output.

Environment helpers

from persona_data.environment import set_seed, get_device, get_artifacts_dir

set_seed(1337)        # sets random, numpy, and torch seeds
device = get_device() # cuda > mps > cpu

Used by

persona-vectors — activation extraction and steering
cues_attribution — section-level ablation attribution
persona-2-lora — LoRA-based persona internalization

Project details

Release history Release notifications | RSS feed

0.7.0

Jun 25, 2026

0.6.0

May 16, 2026

0.5.2

May 14, 2026

0.5.1

May 13, 2026

This version

0.5.0

May 12, 2026

0.4.2

May 8, 2026

0.4.1

May 7, 2026

0.4.0

May 6, 2026

0.3.4

May 4, 2026

0.3.3

May 3, 2026

0.3.1

May 3, 2026

0.2.7

May 1, 2026

0.2.6

May 1, 2026

0.2.5

Apr 29, 2026

0.2.4

Apr 29, 2026

0.2.3

Apr 29, 2026

0.2.2

Apr 20, 2026

0.2.1

Apr 20, 2026

0.2.0

Apr 20, 2026

0.1.0

Apr 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

persona_data-0.5.0.tar.gz (10.0 kB view details)

Uploaded May 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

persona_data-0.5.0-py3-none-any.whl (12.7 kB view details)

Uploaded May 12, 2026 Python 3

File details

Details for the file persona_data-0.5.0.tar.gz.

File metadata

Download URL: persona_data-0.5.0.tar.gz
Upload date: May 12, 2026
Size: 10.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for persona_data-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`16500b6aa2423a1adcc5f9a7f411a4b5df70a493897f1c86818e7d438b43d8f9`
MD5	`a927113618f6af5668155aec1565f8e5`
BLAKE2b-256	`2dbff1e7b1ac192818b631fe4e408965d70e36e4b10eea8527b210d8b9793576`

See more details on using hashes here.

File details

Details for the file persona_data-0.5.0-py3-none-any.whl.

File metadata

Download URL: persona_data-0.5.0-py3-none-any.whl
Upload date: May 12, 2026
Size: 12.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for persona_data-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2683efbe0bc3b240d0ef5c70469fff7f5fe7baba40f2dd8a337d1164766e5a01`
MD5	`fd857d8a61af9de4e13ac229f5d30556`
BLAKE2b-256	`9f69da9c4a7e5e9291eae6551d12790791cd5b59ea5946a0d72c553793f092e3`

See more details on using hashes here.

persona-data 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

persona-data

Overview

Installation

Testing

Package layout

Datasets

SynthPersona

PersonaGuess

Prompt formatting

Environment helpers

Used by

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes