Shared dataset loading and prompt formatting for implicit-personalization projects
Project description
persona-data
Shared dataset loading, prompt formatting, and environment utilities for the implicit-personalization projects.
Overview
persona-data provides the common dataset and prompt helpers used across the persona projects:
SynthPersonaDatasetfor persona profiles plus QA pairsPersonaGuessDatasetfor turn-based persona gamesNemotronPersonasFranceDatasetfor French persona profiles from NVIDIANemotronPersonasUSADatasetfor US persona profiles from NVIDIA- prompt helpers for roleplay and multiple-choice evaluation
- environment helpers for seeds, devices, and artifact paths
Installation
Add as a uv git source in your project's pyproject.toml:
[project]
dependencies = ["persona-data"]
[tool.uv.sources]
persona-data = { git = "ssh://git@github.com/implicit-personalization/persona-data.git" }
Then run uv sync.
For local development alongside other repos, use an editable path source:
[tool.uv.sources]
persona-data = { path = "../persona-data", editable = true }
Testing
uv run --with pytest pytest tests/test_datasets.py
The release workflow also runs tests/smoke_test.py against the built wheel and source distribution.
Package layout
src/persona_data/
├── __init__.py
├── synth_persona.py # SynthPersonaDataset, PersonaDataset, PersonaData, QAPair, Statement
├── persona_guess.py # PersonaGuessDataset, GameRecord, Turn
├── nemotron_personas.py # NemotronPersonasFranceDataset, NemotronPersonasUSADataset
├── prompts.py # format_prompt, format_mc_question, format_messages
└── environment.py # set_seed, get_device, get_artifacts_dir
Datasets
Each dataset is a module with its own types and a loader that downloads from Hugging Face, cached via HF_HOME.
SynthPersona
from persona_data.synth_persona import SynthPersonaDataset
dataset = SynthPersonaDataset()
persona = dataset[0]
persona.name # "Ethan Robinson"
persona.templated_view # short attribute-based system prompt
persona.biography_view # full biography text
persona.statements # list of Statement
qa_pairs = dataset.get_qa(persona.id, type="implicit", item_type="mcq")
# Leakage-aware split: train on individual FRQs, test on shared MCQs.
train_qa, test_qa = dataset.train_test_split(persona.id, n_train=50)
PersonaGuess
from persona_data.persona_guess import PersonaGuessDataset
games = PersonaGuessDataset()
game = games[0]
turns = games.get_qa(game.game_id, player="A")
Prompt formatting
from persona_data.prompts import format_messages, format_prompt
system_prompt = format_prompt(persona, "biography")
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Where did you grow up?"},
{"role": "assistant", "content": "I grew up in Little Rock, Arkansas."},
]
full_prompt, response_start_idx = format_messages(messages, tokenizer)
format_prompt accepts a PersonaData plus one of the standard variants ("templated" or "biography"), or raw profile text. It also accepts mode="roleplay" (default) and mode="conversational".
The persona-less Assistant baseline is just another persona in the dataset under BASELINE_PERSONA_ID ("baseline_assistant"). It appears in normal iteration when loaded, and dataset.baseline retrieves it directly:
dataset = SynthPersonaDataset()
baseline = dataset.baseline # PersonaData | None
system_prompt = format_prompt(baseline, "templated")
Use BASELINE_PERSONA_ID and BASELINE_PERSONA_NAME (both in persona_data.prompts) for artifact naming and UI labels.
For multiple-choice prompts, use format_mc_question(qa) to render the question, choices, and trailing answer-only instruction. Use mc_answer_only_instruction(n_choices) if you need just the instruction text, and mc_correct_letter(qa) to get the gold label.
format_messages handles tokenizers that do not support the "system" role (for example Gemma 2) by merging system content into the first user message. Pass add_generation_prompt=True to render an inference-ready prompt (messages ending in a user turn); the returned response_start_idx then equals the prompt length, ready to slice model.generate output.
Environment helpers
from persona_data.environment import set_seed, get_device, get_artifacts_dir
set_seed(1337) # sets random, numpy, and torch seeds
device = get_device() # cuda > mps > cpu
Used by
- persona-vectors — activation extraction and steering
- cues_attribution — section-level ablation attribution
- persona-2-lora — LoRA-based persona internalization
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file persona_data-0.4.0.tar.gz.
File metadata
- Download URL: persona_data-0.4.0.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.10 {"installer":{"name":"uv","version":"0.11.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea8e61303ca24f9ede93a97ccd700f57d06abc53b4f108e612fce035f9349f3d
|
|
| MD5 |
8bdf53018c153e70cf4c624d3ade4b3e
|
|
| BLAKE2b-256 |
22b213b51011c7f0be74dacb1fe26c4ac08563f127d0bb37361756dae3e4339e
|
File details
Details for the file persona_data-0.4.0-py3-none-any.whl.
File metadata
- Download URL: persona_data-0.4.0-py3-none-any.whl
- Upload date:
- Size: 11.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.10 {"installer":{"name":"uv","version":"0.11.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1f002f676e8eab3afb36c4f71cb358d806bcc56cf5cc52ba3afc34b9a011b39
|
|
| MD5 |
d05e8a7c51dec2744524c601d9c98929
|
|
| BLAKE2b-256 |
e8bd3672e0d636cd2072d0be8ce92de48efb02101788f9e777716b3cae8e69f1
|