Shared dataset loading and prompt formatting for implicit-personalization projects
Project description
persona-data
Shared dataset loading, prompt formatting, and environment utilities for the implicit-personalization projects.
Overview
persona-data provides the common dataset and prompt helpers used across the persona projects:
SynthPersonaDatasetfor persona profiles plus QA pairsPersonaGuessDatasetfor turn-based persona gamesNemotronPersonasFranceDatasetfor French persona profiles from NVIDIANemotronPersonasUSADatasetfor US persona profiles from NVIDIA- prompt helpers for roleplay and multiple-choice evaluation
- environment helpers for seeds, devices, and artifact paths
Installation
Add as a uv git source in your project's pyproject.toml:
[project]
dependencies = ["persona-data"]
[tool.uv.sources]
persona-data = { git = "ssh://git@github.com/implicit-personalization/persona-data.git" }
Then run uv sync.
For local development alongside other repos, use an editable path source:
[tool.uv.sources]
persona-data = { path = "../persona-data", editable = true }
Testing
uv run --with pytest pytest tests/test_datasets.py
Package layout
src/persona_data/
├── __init__.py
├── synth_persona.py # SynthPersonaDataset, PersonaDataset, PersonaData, QAPair, BiographySection
├── persona_guess.py # PersonaGuessDataset, GameRecord, Turn
├── nemotron_personas.py # NemotronPersonasFranceDataset, NemotronPersonasUSADataset
├── prompts.py # format_roleplay_prompt, system_prompt_for_variant, format_mc_question, format_messages
└── environment.py # load_env, set_seed, get_device, get_artifacts_dir
Datasets
Each dataset is a module with its own types and a loader that downloads from Hugging Face, cached via HF_HOME.
SynthPersona
from persona_data.synth_persona import SynthPersonaDataset
dataset = SynthPersonaDataset()
persona = dataset[0]
persona.name # "Ethan Robinson"
persona.templated_view # short attribute-based system prompt
persona.biography_view # full biography text
persona.sections # list of BiographySection
qa_pairs = dataset.get_qa(persona.id, type="implicit", difficulty=[1, 2])
questions = dataset.questions(persona.id, type="explicit")
PersonaGuess
from persona_data.persona_guess import PersonaGuessDataset
games = PersonaGuessDataset()
game = games[0]
turns = games.get_qa(game.game_id, player="A")
questions = games.questions(game.game_id, player="B")
Prompt formatting
from persona_data.prompts import format_messages, format_roleplay_prompt
system_prompt = format_roleplay_prompt(persona.biography_view)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Where did you grow up?"},
{"role": "assistant", "content": "I grew up in Little Rock, Arkansas."},
]
full_prompt, response_start_idx = format_messages(messages, tokenizer)
format_roleplay_prompt supports mode="roleplay" (default) and mode="conversational".
Use system_prompt_for_variant(persona, variant) when iterating over persona variants — it returns a persona-less prompt for "baseline" and reads <variant>_view otherwise.
For multiple-choice prompts, use format_mc_question(qa) to render the question, choices, and trailing answer-only instruction. Use mc_answer_only_instruction(n_choices) if you need just the instruction text, and mc_correct_letter(qa) to get the gold label.
format_messages handles tokenizers that do not support the "system" role (for example Gemma 2) by merging system content into the first user message.
Environment helpers
from persona_data.environment import load_env, set_seed, get_device, get_artifacts_dir
load_env() # loads .env from cwd (searches parent dirs)
set_seed(1337) # sets random, numpy, and torch seeds
device = get_device() # cuda > mps > cpu
Used by
- persona-vectors — activation extraction and steering
- cues_attribution — section-level ablation attribution
- persona-2-lora — LoRA-based persona internalization
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file persona_data-0.2.2.tar.gz.
File metadata
- Download URL: persona_data-0.2.2.tar.gz
- Upload date:
- Size: 8.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
668abd6b1fb5bdf92de89195d0de69a1f1a89b0d18d87c9fd9f50b2cae84dc6f
|
|
| MD5 |
15d040b36cdd3b437a82139f264d24f6
|
|
| BLAKE2b-256 |
35aee5648d098773786d667d80fc6fa786158a61991abe43cf1d9921060cdb08
|
File details
Details for the file persona_data-0.2.2-py3-none-any.whl.
File metadata
- Download URL: persona_data-0.2.2-py3-none-any.whl
- Upload date:
- Size: 11.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89704980f770502ad2208e7873ebbd81734f7d168f5c075641d21d38d91d65a1
|
|
| MD5 |
9857a588f7fea0f0a84e1f360e48be60
|
|
| BLAKE2b-256 |
4bbd3c2f360ffd4e5d18ce2355b1bdfc494febd6ebdea7da9c5f4833d7131ee3
|