Skip to main content

Process, transform and analyze survey-format data with clean and simple API.

Project description

survy

PyPI version License Python CI

A Python library for survey data โ€” treating multiselect questions as first-class variables, not DataFrame workarounds.

graph TD
    A[๐Ÿ“‚ Data Sources] --> B[โš™๏ธ survy]
    B --> C[๐Ÿ”„ Transform]
    C --> D[๐Ÿ“ค Export]
    C --> E[๐Ÿ” Explore]
    E --> F[๐Ÿ“Š Analyze]

๐Ÿ“‹ Table of Contents


๐Ÿ“ฆ Why survy?

Survey data has a construct that no general-purpose Python tool handles correctly: multiselect questions โ€” "choose all that apply" questions where one respondent selects multiple answers. Raw data stores these as either multiple columns (hobby_1, hobby_2, hobby_3) or delimited strings ("Sport;Book"), but pandas treats them as unrelated columns or plain text. Every project, you rewrite the same boilerplate to split, group, count, and export them โ€” and get it subtly wrong (counting responses instead of respondents, losing column groupings, breaking on format changes).

SPSS solved this decades ago with native multiple response sets. R has partial solutions scattered across expss, MRCV, and surveydata. But Python โ€” the language AI coding tools actually generate โ€” had nothing.

survy makes MULTISELECT a first-class variable type. Load your data, and survy auto-detects the format, merges columns into logical variables, and carries that type awareness through frequencies, crosstabs, filtering, and export. The correct code is also the simple code โ€” which means AI assistants can generate it reliably too.


โœจ Features

  • ๐Ÿ”น Multiselect as a first-class concept โ€” both compact and wide formats auto-detected
  • ๐Ÿ”น Read & write multiple formats: CSV, Excel, JSON, SPSS
  • ๐Ÿ”น Built-in tools for validation, tracking, and analysis
  • ๐Ÿ”น Cross-tabulation with significance testing
  • ๐Ÿ”น AI-ready โ€” ships with an agent skill so LLM coding assistants generate correct survy code

๐Ÿš€ Installation

pip install survy

โšก Quick Demo

import survy

# Load a CSV with wide multiselect columns (hobby_1, hobby_2, ...)
survey = survy.read_csv("data.csv")

# Or load a CSV with compact multiselect columns ("Sport;Book")
survey = survy.read_csv("data_compact.csv", auto_detect=True, compact_separator=";")

print(survey)
# Survey (4 variables)
#   Variable(id=gender, label=gender, value_indices={'Female': 1, 'Male': 2}, base=3)
#   Variable(id=yob, label=yob, value_indices={}, base=3)
#   Variable(id=hobby, label=hobby, value_indices={'Book': 1, 'Movie': 2, 'Sport': 3}, base=3)
#   Variable(id=animal, label=animal, value_indices={'Cat': 1, 'Dog': 2}, base=3)

# Both formats produce the same result
print(survey.get_df())
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ gender โ”† yob  โ”† hobby              โ”† animal         โ”‚
# โ”‚ ---    โ”† ---  โ”† ---                โ”† ---            โ”‚
# โ”‚ str    โ”† i64  โ”† list[str]          โ”† list[str]      โ”‚
# โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
# โ”‚ Male   โ”† 2000 โ”† ["Book", "Sport"]  โ”† ["Cat", "Dog"] โ”‚
# โ”‚ Female โ”† 1999 โ”† ["Movie", "Sport"] โ”† ["Dog"]        โ”‚
# โ”‚ Male   โ”† 1998 โ”† ["Movie"]          โ”† ["Cat"]        โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

# Frequencies
print(survey["gender"].frequencies)
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ gender โ”† count โ”† proportion โ”‚
# โ”‚ ---    โ”† ---   โ”† ---        โ”‚
# โ”‚ str    โ”† u32   โ”† f64        โ”‚
# โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
# โ”‚ Female โ”† 1     โ”† 0.333333   โ”‚
# โ”‚ Male   โ”† 2     โ”† 0.666667   โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

# Crosstab with significance testing
print(survy.crosstab(survey["gender"], survey["hobby"]))
# {'Total': shape: (3, 3)
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ hobby โ”† Male (A) โ”† Female (B) โ”‚
# โ”‚ ---   โ”† ---      โ”† ---        โ”‚
# โ”‚ str   โ”† str      โ”† str        โ”‚
# โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
# โ”‚ Book  โ”† 1        โ”† 0          โ”‚
# โ”‚ Movie โ”† 1        โ”† 1          โ”‚
# โ”‚ Sport โ”† 1        โ”† 1          โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜}

๐Ÿ“ฅ Usage

Understanding Multiselect Formats

The key challenge with survey data is multiselect questions โ€” questions where a respondent can choose multiple answers. Raw data encodes these in two different layouts, and survy handles both.

Wide format spreads each answer across its own column, using a shared prefix plus a separator and numeric suffix (e.g. _1, _2, ...):

gender yob hobby_1 hobby_2 hobby_3 animal_1 animal_2
Male 2000 Book Sport Cat Dog
Female 1999 Movie Dog
Male 1998 Movie Cat

survy groups columns by parsing the name with a name_pattern template (default "id(_multi)?"). The tokens id and multi are named placeholders, and _, ., : are recognized separators. So hobby_1 is parsed as id="hobby", multi="1" โ€” all columns sharing the same id are merged into one multiselect variable. This happens automatically โ€” no extra parameters needed.

Compact format stores all selected answers in a single cell, joined by a separator (typically ;):

gender yob hobby animal
Male 2000 Book;Sport Cat;Dog
Female 1999 Movie;Sport Dog
Male 1998 Movie Cat

survy splits these cells on the separator to recover individual choices. Because a semicolon could be regular text, compact format is not auto-detected by default โ€” you must either list the compact columns with compact_ids or enable auto_detect=True.

After reading, both formats produce the exact same internal representation โ€” a MULTISELECT variable with a sorted list of chosen values per respondent.


Load Data

# Available read functions
survy.read_csv       # CSV files
survy.read_excel     # Excel files (.xlsx)
survy.read_json      # survy-format JSON
survy.read_polars    # Polars DataFrame already in memory

CSV / Excel

# Wide format โ€” auto-detected, no special parameters needed
survey = survy.read_csv("data.csv")

# Compact format โ€” explicitly specify which columns are compact
survey = survy.read_csv(
    "data_compact.csv",
    compact_ids=["hobby", "animal"],  # columns that use compact encoding
    compact_separator=";",            # delimiter inside cells
)

# Compact format โ€” let survy scan for the separator automatically
survey = survy.read_csv(
    "data_compact.csv",
    auto_detect=True,          # scans all columns for the separator
    compact_separator=";",
)

# Mixed: some columns are wide, some are compact
# Wide is always auto-detected; just specify the compact ones
survey = survy.read_csv("data_mixed.csv", compact_ids=["Q5"], compact_separator=";")

# Custom name_pattern for wide detection if columns use a different naming convention
# Tokens: "id" (base name), "multi" (suffix). Separators: _ . :
# Example: "id.multi" would match "Q1.1", "Q1.2", etc.
survey = survy.read_csv("data.csv", name_pattern="id.multi")

# Excel โ€” identical API
survey = survy.read_excel("data.xlsx", auto_detect=True, compact_separator=";")

Important: Do not combine auto_detect=True with compact_ids in the same call. Use one approach or the other.

JSON

The JSON file must follow survy's format โ€” a "variables" array where each entry has "id", "data", "label", and "value_indices":

# Expected format: data.json
# {
#     "variables": [
#         {
#             "id": "gender",
#             "data": ["Male", "Female", "Male"],
#             "label": "",
#             "value_indices": {"Female": 1, "Male": 2}
#         },
#         {
#             "id": "yob",
#             "data": [2000, 1999, 1998],
#             "label": "",
#             "value_indices": {}
#         },
#         {
#             "id": "hobby",
#             "data": [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]],
#             "label": "",
#             "value_indices": {"Book": 1, "Movie": 2, "Sport": 3}
#         }
#     ]
# }

survey = survy.read_json("data.json")

The "data" field varies by type: a flat list of strings for SELECT, a flat list of numbers for NUMBER, and a list of lists for MULTISELECT. The "value_indices" field should be {} for NUMBER variables.

Polars DataFrame

import polars

df = polars.DataFrame({
    "gender": ["Male", "Female", "Male"],
    "yob": [2000, 1999, 1998],
    "hobby": ["Sport;Book", "Sport;Movie", "Movie"],
    "animal_1": ["Cat", "", "Cat"],
    "animal_2": ["Dog", "Dog", ""],
})

# Supports the same parameters as read_csv: compact_ids, auto_detect, name_pattern
survey = survy.read_polars(df, auto_detect=True, compact_separator=";")

Work with Survey

print(survey)
# Survey (4 variables)
#   Variable(id=gender, label=gender, value_indices={'Female': 1, 'Male': 2}, base=3)
#   Variable(id=yob, label=yob, value_indices={}, base=3)
#   Variable(id=hobby, label=hobby, value_indices={'Book': 1, 'Movie': 2, 'Sport': 3}, base=3)
#   Variable(id=animal, label=animal, value_indices={'Cat': 1, 'Dog': 2}, base=3)

Methods & Properties

Method / Property Description
get_df() Return survey data as a polars.DataFrame
update() Update metadata (labels, value indices) of variables
add() Add a variable to the survey
drop() Remove a variable from the survey
filter() Filter respondents by variable values (returns a new Survey)
sort() Sort variables by given logic
to_csv() Export to CSV (3 files: data + variable info + value mappings)
to_excel() Export to Excel (same structure as CSV)
to_json() Export to JSON
to_spss() Export to SPSS format (.sav + .sps)
.variables Collection of all variables
.sps Render SPSS syntax string

DataFrame Output Formats

The get_df() method supports flexible output through select_dtype and multiselect_dtype parameters:

# Compact multiselect (list columns) โ€” default
print(survey.get_df(select_dtype="text", multiselect_dtype="compact"))
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ gender โ”† yob  โ”† hobby              โ”† animal         โ”‚
# โ”‚ ---    โ”† ---  โ”† ---                โ”† ---            โ”‚
# โ”‚ str    โ”† i64  โ”† list[str]          โ”† list[str]      โ”‚
# โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
# โ”‚ Male   โ”† 2000 โ”† ["Book", "Sport"]  โ”† ["Cat", "Dog"] โ”‚
# โ”‚ Female โ”† 1999 โ”† ["Movie", "Sport"] โ”† ["Dog"]        โ”‚
# โ”‚ Male   โ”† 1998 โ”† ["Movie"]          โ”† ["Cat"]        โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

# Wide text format (split columns, numeric category codes)
print(survey.get_df(select_dtype="number", multiselect_dtype="text"))
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ gender โ”† yob  โ”† hobby_1 โ”† hobby_2 โ”† hobby_3 โ”† animal_1 โ”† animal_2 โ”‚
# โ”‚ ---    โ”† ---  โ”† ---     โ”† ---     โ”† ---     โ”† ---      โ”† ---      โ”‚
# โ”‚ i64    โ”† i64  โ”† str     โ”† str     โ”† str     โ”† str      โ”† str      โ”‚
# โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
# โ”‚ 2      โ”† 2000 โ”† Book    โ”† null    โ”† Sport   โ”† Cat      โ”† Dog      โ”‚
# โ”‚ 1      โ”† 1999 โ”† null    โ”† Movie   โ”† Sport   โ”† null     โ”† Dog      โ”‚
# โ”‚ 2      โ”† 1998 โ”† null    โ”† Movie   โ”† null    โ”† Cat      โ”† null     โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

# Fully numeric (binary-encoded multiselect)
print(survey.get_df(select_dtype="number", multiselect_dtype="number"))
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ gender โ”† yob  โ”† hobby_1 โ”† hobby_2 โ”† hobby_3 โ”† animal_1 โ”† animal_2 โ”‚
# โ”‚ ---    โ”† ---  โ”† ---     โ”† ---     โ”† ---     โ”† ---      โ”† ---      โ”‚
# โ”‚ i64    โ”† i64  โ”† i8      โ”† i8      โ”† i8      โ”† i8       โ”† i8       โ”‚
# โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
# โ”‚ 2      โ”† 2000 โ”† 1       โ”† 0       โ”† 1       โ”† 1        โ”† 1        โ”‚
# โ”‚ 1      โ”† 1999 โ”† 0       โ”† 1       โ”† 1       โ”† 0        โ”† 1        โ”‚
# โ”‚ 2      โ”† 1998 โ”† 0       โ”† 1       โ”† 0       โ”† 1        โ”† 0        โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Updating Survey Metadata

survey.update(
    [
        {"id": "gender", "label": "Please indicate your gender."},
        {"id": "hobby", "value_indices": {"Sport": 1, "Book": 2, "Movie": 3}},
    ]
)
print(survey)
# Survey (4 variables)
#   Variable(id=gender, label=Please indicate your gender., value_indices={'Female': 1, 'Male': 2}, base=3)
#   Variable(id=yob, label=yob, value_indices={}, base=3)
#   Variable(id=hobby, label=hobby, value_indices={'Sport': 1, 'Book': 2, 'Movie': 3}, base=3)
#   Variable(id=animal, label=animal, value_indices={'Cat': 1, 'Dog': 2}, base=3)

Adding, Dropping, Sorting, and Filtering

import polars

# Add a variable (auto-wrapped from polars.Series)
survey.add(polars.Series("region", ["North", "South", "North"]))

# Drop a variable (silently ignored if not found)
survey.drop("region")

# Sort variables in-place
survey.sort()                                      # alphabetical by id
survey.sort(key=lambda v: v.base, reverse=True)    # by response count

# Filter respondents โ€” returns a new Survey, original is not mutated
filtered = survey.filter("hobby", ["Sport", "Book"])
filtered = survey.filter("gender", "Male")         # single value also works

For multiselect variables, filter() keeps a row if any of its selected values appears in the filter list.


Work with Variables

hobby = survey["hobby"]
print(hobby)
# Variable(id=hobby, label=hobby, value_indices={'Book': 1, 'Movie': 2, 'Sport': 3}, base=3)

Methods & Properties

Method / Property Description
get_df() Return variable data as a polars.DataFrame
to_dict() Serialize variable to a dictionary
replace() Remap values using a given mapping
.series Variable data as a polars.Series
.id Variable identifier (read/write)
.label Variable label string (read/write)
.value_indices Mapping of response values to numeric codes (read/write)
.vtype Variable type: select, multi_select, or number
.base Count of valid (non-null) responses
.len Total number of responses
.dtype Underlying Polars data type
.frequencies DataFrame of counts and proportions per value
.sps SPSS syntax string for this variable

Variable DataFrame Formats

hobby = survey["hobby"]

# Compact (list column)
hobby.get_df("compact")
# shape: (3, 1)
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ hobby              โ”‚
# โ”‚ ---                โ”‚
# โ”‚ list[str]          โ”‚
# โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
# โ”‚ ["Book", "Sport"]  โ”‚
# โ”‚ ["Movie", "Sport"] โ”‚
# โ”‚ ["Movie"]          โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

# Wide text (split columns)
hobby.get_df("text")
# shape: (3, 3)
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ hobby_1 โ”† hobby_2 โ”† hobby_3 โ”‚
# โ”‚ ---     โ”† ---     โ”† ---     โ”‚
# โ”‚ str     โ”† str     โ”† str     โ”‚
# โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
# โ”‚ Book    โ”† null    โ”† Sport   โ”‚
# โ”‚ null    โ”† Movie   โ”† Sport   โ”‚
# โ”‚ null    โ”† Movie   โ”† null    โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

# Binary-encoded (split columns, 0/1)
hobby.get_df("number")
# shape: (3, 3)
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ hobby_1 โ”† hobby_2 โ”† hobby_3 โ”‚
# โ”‚ ---     โ”† ---     โ”† ---     โ”‚
# โ”‚ i8      โ”† i8      โ”† i8      โ”‚
# โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
# โ”‚ 1       โ”† 0       โ”† 1       โ”‚
# โ”‚ 0       โ”† 1       โ”† 1       โ”‚
# โ”‚ 0       โ”† 1       โ”† 0       โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Updating Variables

hobby.value_indices = {"Sport": 1, "Book": 2, "Movie": 3}
hobby.label = "Please tell us your hobbies."
print(hobby)
# Variable(id=hobby, label=Please tell us your hobbies., value_indices={'Sport': 1, 'Book': 2, 'Movie': 3}, base=3)

# Remap values โ€” works for both SELECT and MULTISELECT
hobby.replace({"Book": "Reading"})
print(hobby)
# Variable(id=hobby, label=Please tell us your hobbies., value_indices={'Movie': 1, 'Reading': 2, 'Sport': 3}, base=3)

Note: The value_indices setter validates that your mapping covers every value present in the data. If any value is missing, it raises a DataStructureError.


Analyze

Frequency Table

print(survey["gender"].frequencies)
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ gender โ”† count โ”† proportion โ”‚
# โ”‚ ---    โ”† ---   โ”† ---        โ”‚
# โ”‚ str    โ”† u32   โ”† f64        โ”‚
# โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
# โ”‚ Female โ”† 1     โ”† 0.333333   โ”‚
# โ”‚ Male   โ”† 2     โ”† 0.666667   โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Cross-tabulation

The survy.crosstab() function supports count, percent, and numeric aggregations, with optional significance testing and filtering.

Signature:

survy.crosstab(
    column,           # Column variable โ€” the grouping dimension (e.g., gender)
    row,              # Row variable โ€” the analyzed dimension (e.g., hobby)
    filter=None,      # Optional variable to segment into multiple tables
    aggfunc="count",  # "count", "percent", "mean", "median", or "sum"
    alpha=0.05,       # Significance level for statistical tests
)
# Returns: dict[str, polars.DataFrame]
# Key is "Total" when no filter, or each filter-value when filter is provided

Count:

print(survy.crosstab(survey["gender"], survey["hobby"], aggfunc="count"))
# {'Total': shape: (3, 3)
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ hobby โ”† Male (A) โ”† Female (B) โ”‚
# โ”‚ ---   โ”† ---      โ”† ---        โ”‚
# โ”‚ str   โ”† str      โ”† str        โ”‚
# โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
# โ”‚ Book  โ”† 1        โ”† 0          โ”‚
# โ”‚ Movie โ”† 1        โ”† 1          โ”‚
# โ”‚ Sport โ”† 1        โ”† 1          โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜}

Percent:

print(survy.crosstab(survey["gender"], survey["hobby"], aggfunc="percent"))
# {'Total': shape: (3, 3)
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ hobby โ”† Male (A) โ”† Female (B) โ”‚
# โ”‚ ---   โ”† ---      โ”† ---        โ”‚
# โ”‚ str   โ”† str      โ”† str        โ”‚
# โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
# โ”‚ Book  โ”† 0.5      โ”† 0.0        โ”‚
# โ”‚ Movie โ”† 0.5      โ”† 1.0        โ”‚
# โ”‚ Sport โ”† 0.5      โ”† 1.0        โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜}

Mean (numeric variable):

print(survy.crosstab(survey["gender"], survey["yob"], aggfunc="mean"))
# {'Total': shape: (1, 3)
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ yob โ”† Female  โ”† Male    โ”‚
# โ”‚ --- โ”† ---     โ”† ---     โ”‚
# โ”‚ str โ”† str     โ”† str     โ”‚
# โ•žโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
# โ”‚ yob โ”† 1999.0  โ”† 1999.0  โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜}

With filter variable (produces one table per filter category):

print(
    survy.crosstab(
        survey["gender"],
        survey["hobby"],
        filter=survey["animal"],
        aggfunc="count",
    )
)
# {'Cat': shape: (3, 2)
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ hobby โ”† Male (A) โ”‚
# ...
# 'Dog': shape: (3, 3)
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ hobby โ”† Male (A) โ”† Female (B) โ”‚
# ...
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜}

Significance testing uses a two-proportion z-test for "count"/"percent" and Welch's t-test for numeric aggregations. Significant differences are indicated by column letter labels (e.g. "A", "B").


Export

All export methods take a directory path (not a file path) and an optional name parameter for the base filename. Do not pass a full file path like "output/results.csv" โ€” pass the directory and use name=.

CSV / Excel

Writes three files:

  • {name}_data.csv โ€” survey responses (format depends on the compact parameter)
  • {name}_variables_info.csv โ€” variable metadata: id, vtype (SINGLE / MULTISELECT / NUMBER), label
  • {name}_values_info.csv โ€” value-to-index mappings: id, text, index

The compact parameter (default False) controls how multiselect columns appear in the data file. When False, multiselect variables are expanded to wide columns (hobby_1, hobby_2, ...). When True, values are joined into a single cell ("Book;Sport").

# Default (compact=False) โ€” multiselect expanded to wide columns
survey.to_csv("output/", name="results")

# Compact mode โ€” multiselect joined into single cells
survey.to_csv("output/", name="results", compact=True, compact_separator=";")

# Excel โ€” identical API, writes .xlsx files instead
survey.to_excel("output/", name="results")

SPSS

Writes {name}.sav (data) and {name}.sps (syntax). Requires pyreadstat.

survey.to_spss("output/", name="results")

# You can also get the SPSS syntax string directly
print(survey.sps)
# VARIABLE LABELS gender 'gender'.
# VALUE LABELS gender 1 'Female'
# 2 'Male'.
# VARIABLE LEVEL gender (NOMINAL).
# ...

JSON

Writes {name}.json in the same structure that read_json expects. The output includes an extra "vtype" field per variable that read_json ignores on re-read (the type is re-inferred from the data).

survey.to_json("output/", name="results")

๐Ÿค– Agent Skill

survy ships with a structured agent skill (SKILL.md) โ€” a reference document designed for LLM-based coding assistants like Claude, Copilot, and similar tools. When an AI agent reads this file, it can generate correct survy code without hallucinating parameters, confusing compact and wide formats, or inventing methods that don't exist.

The skill covers the full public API with correct signatures, defaults, and examples, plus a numbered gotchas section addressing the most common mistakes (like passing a file path instead of a directory to export methods, or combining auto_detect with compact_ids).

The skill package includes:

  • SKILL.md โ€” complete API reference with compact-vs-wide format explanation, JSON schema, and gotchas
  • references/api_reference.md โ€” quick-lookup method signatures
  • scripts/validate_survey.py โ€” check a survey file for missing labels and unset value indices
  • scripts/batch_export.py โ€” export a survey to all formats in one pass
  • assets/sample_data.csv / assets/sample_data_compact.csv โ€” sample datasets for testing

Install the agent skill

The skill files are included in the repo under /skills. If your AI tool supports skill installation:

npx skills add https://github.com/hoanghaoha/survy

Or manually copy the skills/ directory into your project's .claude/skills/ or equivalent location.


๐Ÿง  Design Philosophy

  • Keep survey logic explicit โ€” variables, labels, and value mappings are first-class objects
  • Treat multiselect questions as a native data type, not a post-processing concern
  • Provide a clean abstraction over high-performance data processing (powered by Polars)

๐Ÿค Contributing

Contributions are welcome! Feel free to open issues or submit pull requests on GitHub.


๐Ÿ“„ License

MIT License โ€” see LICENSE for details.


๐Ÿ”— References

  • Powered by Polars for fast data processing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

survy-0.2.3.tar.gz (47.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

survy-0.2.3-py3-none-any.whl (52.5 kB view details)

Uploaded Python 3

File details

Details for the file survy-0.2.3.tar.gz.

File metadata

  • Download URL: survy-0.2.3.tar.gz
  • Upload date:
  • Size: 47.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for survy-0.2.3.tar.gz
Algorithm Hash digest
SHA256 1fc2faa7a142bad1be731d92349aa82fc980d2539f4645a2632273cf85533b64
MD5 73caa1a2c793a45989661d2b60bd6a6e
BLAKE2b-256 be16d84c328f2b58914b2704bb114f803db5edfe53e67704ee0f88e5cb241113

See more details on using hashes here.

File details

Details for the file survy-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: survy-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 52.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for survy-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 afb2b6b00eccaaf770af4f300ed39fc30c10bd7566b634517de98da3cfc6e628
MD5 e0c09d9d0f4b942fafbdff1c9f3738c3
BLAKE2b-256 a2a0b929845ba591d178d2987e8a7d2657fe83ad186f81f1c42249a39f4c93fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page