Skip to main content

Shared dataset loading and prompt formatting for implicit-personalization projects

Project description

persona-data

Docs

Shared dataset loading, prompt formatting, and environment utilities for the implicit-personalization projects.

Overview

persona-data provides the common dataset and prompt helpers used across the persona projects:

  • SynthPersonaDataset for persona profiles plus QA pairs
  • PersonaGuessDataset for turn-based persona games
  • NemotronPersonasFranceDataset for French persona profiles from NVIDIA
  • NemotronPersonasUSADataset for US persona profiles from NVIDIA
  • prompt helpers for roleplay and multiple-choice evaluation
  • environment helpers for seeds, devices, and artifact paths

Installation

Add as a uv git source in your project's pyproject.toml:

[project]
dependencies = ["persona-data"]

[tool.uv.sources]
persona-data = { git = "ssh://git@github.com/implicit-personalization/persona-data.git" }

Then run uv sync.

For local development alongside other repos, use an editable path source:

[tool.uv.sources]
persona-data = { path = "../persona-data", editable = true }

Testing

uv run --with pytest pytest tests/test_datasets.py

The release workflow also runs tests/smoke_test.py against the built wheel and source distribution.

Package layout

src/persona_data/
├── __init__.py
├── synth_persona.py       # SynthPersonaDataset, PersonaDataset, PersonaData, QAPair, BiographySection, Statement
├── persona_guess.py       # PersonaGuessDataset, GameRecord, Turn
├── nemotron_personas.py   # NemotronPersonasFranceDataset, NemotronPersonasUSADataset
├── prompts.py             # format_prompt, format_mc_question, format_messages
└── environment.py         # load_env, set_seed, get_device, get_artifacts_dir

Datasets

Each dataset is a module with its own types and a loader that downloads from Hugging Face, cached via HF_HOME.

SynthPersona

from persona_data.synth_persona import SynthPersonaDataset

dataset = SynthPersonaDataset()

persona = dataset[0]
persona.name              # "Ethan Robinson"
persona.templated_view    # short attribute-based system prompt
persona.biography_view    # full biography text
persona.sections          # list of BiographySection

qa_pairs = dataset.get_qa(persona.id, type="implicit", difficulty=[1, 2])
questions = dataset.questions(persona.id, type="explicit")

PersonaGuess

from persona_data.persona_guess import PersonaGuessDataset

games = PersonaGuessDataset()
game = games[0]
turns = games.get_qa(game.game_id, player="A")
questions = games.questions(game.game_id, player="B")

Prompt formatting

from persona_data.prompts import format_messages, format_prompt

system_prompt = format_prompt(persona, "biography")

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Where did you grow up?"},
    {"role": "assistant", "content": "I grew up in Little Rock, Arkansas."},
]
full_prompt, response_start_idx = format_messages(messages, tokenizer)

format_prompt accepts raw profile text, a PersonaData plus one of the standard variants ("templated" or "biography"), or no persona for the "baseline" prompt. It also accepts mode="roleplay" (default) and mode="conversational".

When iterating over variants in an experiment, pass "templated" or "biography" to format_prompt(persona, variant). Calling format_prompt() yields the persona-less Assistant baseline prompt. Use BASELINE_PERSONA_ID and BASELINE_PERSONA_NAME from persona_data.prompts for the shared baseline identity in artifacts and UI labels.

For multiple-choice prompts, use format_mc_question(qa) to render the question, choices, and trailing answer-only instruction. Use mc_answer_only_instruction(n_choices) if you need just the instruction text, and mc_correct_letter(qa) to get the gold label.

format_messages handles tokenizers that do not support the "system" role (for example Gemma 2) by merging system content into the first user message. Pass add_generation_prompt=True to render an inference-ready prompt (messages ending in a user turn); the returned response_start_idx then equals the prompt length, ready to slice model.generate output.

Environment helpers

from persona_data.environment import load_env, set_seed, get_device, get_artifacts_dir

load_env()            # loads .env from cwd (searches parent dirs)
set_seed(1337)        # sets random, numpy, and torch seeds
device = get_device() # cuda > mps > cpu

Used by

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

persona_data-0.2.7.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

persona_data-0.2.7-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file persona_data-0.2.7.tar.gz.

File metadata

  • Download URL: persona_data-0.2.7.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for persona_data-0.2.7.tar.gz
Algorithm Hash digest
SHA256 6cd95f82d628a2add68df3935ae47c36a13ab17bafaf495892be5b1192f4693e
MD5 3169a8f8335cc5309d974b6538ddb1b7
BLAKE2b-256 e9c18612da5d216e484f823333fcf6bc1ff97c266c800b597d70f9a19475b78e

See more details on using hashes here.

File details

Details for the file persona_data-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: persona_data-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for persona_data-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 4395f46381d54e7047bd26a4eaa2e9db1bd29f80e00a813b8be3ede0192aafcb
MD5 ddafb918b1a23d6555c0519051effde4
BLAKE2b-256 b0ad65fff568431ae5620f15d5b9c1dff9a941fc77f8c6ebb6f3131c9010d713

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page