Transform survey data (KoboToolbox, LimeSurvey) into DDI-adjacent formats

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

This repository is part of the Civic Data Lab's Survey Toolbox. In the Toolbox you'll find information and guidance on everything related to surveys — from design to analysis.

survey2ddi

Bridge the gap between raw survey exports and archival-grade metadata.

Survey platforms like KoboToolbox and LimeSurvey are excellent for data collection, but their raw exports are often difficult to use for long-term archiving or secondary analysis. They frequently lack clear labels, structured metadata, and standardized formats.

survey2ddi transforms these raw exports into a standardized pair of files:

DDI-Codebook 2.5 XML: A machine-readable schema containing all question texts, choice labels, and group structures. Compatible with the qwac question bank.
DDI-Aligned CSV: Clean response data where column headers match the XML variable names exactly, and multi-select questions are expanded into binary indicators.

Why use this?

Long-term Archiving: Move away from cryptic CSVs to self-documenting DDI metadata.
Interoperability: Import your survey structure directly into tools like qwac.
Data Analysis: Use the generated XML to automatically apply labels to your data in R, Python, or Stata (see our example notebook).
Reproducibility: Maintain a strict link between your data collection instrument (XLSForm/TSV) and the resulting dataset.

Setup

Requires Python 3.13+. Install with uv:

uv sync

Copy .env.example to .env and fill in your credentials.

Usage

1. Pull data from your platform

First, download the raw responses and structure from your survey platform.

KoboToolbox:

uv run kobo2ddi pull <asset_uid>

LimeSurvey:

uv run limesurvey2ddi pull <survey_id>
# Then export the "Survey structure" (TSV) from LimeSurvey's admin UI
# and place it in the output folder as survey.tsv

LimeSurvey schema source is the survey-structure TSV (Surveys → Export → Survey structure). XLSForm input is no longer supported on the LimeSurvey side — use the xlsform2lstsv bridge if you author surveys in XLSForm.

2. Transform to DDI + CSV

Once the data is cached locally, generate the standardized outputs.

# KoboToolbox
uv run kobo2ddi transform <asset_uid>

# OR use a raw CSV export from the GUI instead of the API JSON
# (Must be exported with "XML values and headers" and use , or ; as delimiter)
uv run kobo2ddi transform <asset_uid> --data path/to/export.csv --title "My Study"

# LimeSurvey (uses cached responses.json + survey.tsv)
uv run limesurvey2ddi transform <survey_id> --title "My Research Study"

# OR use a raw CSV export from the GUI instead of the API JSON
# (Must be exported with "Question codes" as headers)
uv run limesurvey2ddi transform <survey_id> --data path/to/export.csv --title "My Study"

The output will be saved in output/<id>/ as <id>.xml and <id>.csv.

Metadata only (no responses)

If you only need the DDI codebook — e.g. before the survey is fielded, or to import the structure into qwac — skip the response data entirely. No API call, no CSV output.

# KoboToolbox: from a local form.xlsx
uv run kobo2ddi metadata path/to/form.xlsx --title "My Survey"

# LimeSurvey: survey-structure TSV
uv run limesurvey2ddi metadata path/to/survey.tsv --title "My Survey"

Output XML lands next to the schema file (override with -o).

Examples

We provide a basic example of the generated output and a Jupyter notebook showing how to use them for analysis in the examples/ directory.

To run the example notebook:

uv sync --group notebook
cd examples/basic
uv run jupyter notebook analysis_example.ipynb

Running tests

uv run pytest

Tests include XSD validation of generated XML against the official DDI-Codebook 2.5 schema (requires xmllint, auto-skipped if not available).

Integration tests (Schematron)

Schematron rules go beyond the XSD (uniqueness of IDs, consistency between variable groups and their members, _other conventions, etc.). They live in qwacback and are checked by a Java worker exposed behind POST /api/validate. The integration suite boots that stack via docker compose and posts generated XML to it.

uv run pytest -m integration

Requires Docker. The session fixture pulls ghcr.io/correlaid/qwacback{,-schematron-worker}:latest, waits for readiness, then tears the stack down. Cold start is ~20-30s; subsequent tests in the same session are ~1s.

To iterate faster, keep the stack up and point tests at it:

docker compose -f tests/integration/docker-compose.validate.yml up -d --wait
S2D_VALIDATE_URL=http://127.0.0.1:8090 uv run pytest -m integration

Pin to a specific qwacback build with QWACBACK_TAG=sha-abc1234 (or a semver like 0.1.0). Change PB_PORT if 8090 is taken.

Validating XML manually

xmllint --noout --schema tests/schemas/codebook.xsd output/<id>/<id>.xml

The schema files in tests/schemas/ are the official DDI-Codebook 2.5 XSD from the DDI Alliance.

Releasing to PyPI

Releases are published by /.github/workflows/publish.yml, which runs on any tag matching v*. Authentication is via PyPI Trusted Publishers (OIDC) — no API token is stored in the repo.

To cut a new release:

# 1. Bump version in pyproject.toml
# 2. Commit and tag
git add pyproject.toml
git commit -m "chore: release vX.Y.Z"
git tag vX.Y.Z
git push origin main --tags

The workflow builds with uv build, runs the test suite, and publishes the resulting sdist + wheel to https://pypi.org/project/survey2ddi/. Versions are hand-bumped — no automated semantic-release.

One-time setup (already done for this repo):

PyPI → account → Publishing → add Trusted Publisher pointing at CorrelAid/survey2ddi workflow publish.yml in environment pypi.
GitHub → repo settings → Environments → create environment pypi.

Known limitations

Multi-language forms: Only the first label::* column in the XLSForm is used. For bilingual forms, place the preferred language column first.

Repeat groups: Variables inside begin_repeat/end_repeat blocks are silently skipped. KoboToolbox stores repeat data as nested arrays which require a different data model; this is not currently supported.

Plain groups in DDI XML: begin_group/end_group blocks without appearance="table-list" are not emitted as <varGrp> in the XML — their variables appear as standalone <var> elements. Groups with appearance="table-list" become <varGrp type="grid">.

LimeSurvey select_multiple bracket keys: LimeSurvey truncates option codes to 5 characters in its export (e.g. metall → metal). The transform recovers the original code via prefix matching. This fails if two choice codes share the same first 5 characters — a ValueError is raised in that case. It also fails silently (with a warning) if LimeSurvey uses internal answer codes that have no relation to the schema's choice names.

Design

The transform and DDI XML modules (kobo2ddi/transform.py, kobo2ddi/ddi_xml.py) are schema-source-agnostic — they consume parsed survey rows + choices regardless of where they came from. The LimeSurvey adapter (limesurvey2ddi/lstsv.py) parses LimeSurvey's survey-structure TSV into the same shape; limesurvey2ddi/transform.py then normalises LimeSurvey's response export quirks (underscore stripping, select_multiple sub-columns) before passing data to the same core functions.

As a Python library

# KoboToolbox
from kobo2ddi.client import KoboClient
from kobo2ddi.data import build_data_csv
from kobo2ddi.ddi_xml import build_ddi_xml
from kobo2ddi.transform import parse_xlsform, extract_variables

client = KoboClient()
asset = client.get_asset("your_asset_uid")
submissions = client.get_submissions("your_asset_uid")
survey_rows, choices, settings = parse_xlsform(Path("form.xlsx"))
xml_string = build_ddi_xml(asset["name"], survey_rows, choices, settings, submissions)

variables = extract_variables(survey_rows, choices)
csv_string = build_data_csv(variables, submissions)

# LimeSurvey
from limesurvey2ddi.transform import build_data_csv, build_ddi_xml

responses = [...] # from LimeSurvey API
xml_string = build_ddi_xml("My Survey", Path("survey.tsv"), responses)
csv_string = build_data_csv(Path("survey.tsv"), responses)

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

CorrelAid

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.0

May 12, 2026

This version

0.2.0

May 12, 2026

0.1.0

May 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

survey2ddi-0.2.0.tar.gz (229.1 kB view details)

Uploaded May 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

survey2ddi-0.2.0-py3-none-any.whl (28.5 kB view details)

Uploaded May 12, 2026 Python 3

File details

Details for the file survey2ddi-0.2.0.tar.gz.

File metadata

Download URL: survey2ddi-0.2.0.tar.gz
Upload date: May 12, 2026
Size: 229.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for survey2ddi-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a6fabb745b2b1e4694eef18f300edac0c63315d3de1d235abd01681ea402eec4`
MD5	`90e27202689e578ceac039ced9a73653`
BLAKE2b-256	`25813ede4522f3d56ed3d2693a097844f1c669ec49feb60083a6c9ef2db8cc6f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for survey2ddi-0.2.0.tar.gz:

Publisher: publish.yml on CorrelAid/survey2ddi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: survey2ddi-0.2.0.tar.gz
- Subject digest: a6fabb745b2b1e4694eef18f300edac0c63315d3de1d235abd01681ea402eec4
- Sigstore transparency entry: 1518609180
- Sigstore integration time: May 12, 2026
Source repository:
- Permalink: CorrelAid/survey2ddi@eb551f1f383045503e62c1e3660b13481432a127
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/CorrelAid
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@eb551f1f383045503e62c1e3660b13481432a127
- Trigger Event: push

File details

Details for the file survey2ddi-0.2.0-py3-none-any.whl.

File metadata

Download URL: survey2ddi-0.2.0-py3-none-any.whl
Upload date: May 12, 2026
Size: 28.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for survey2ddi-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e096b3852a25a09401dc708a3879346f2ea60ff2c8401f246585a3641cc11ef0`
MD5	`567d2ea8e5bf5b0841e0bf4c40ad03c4`
BLAKE2b-256	`16c3e101d62a9c4c9b1cd49e35d4ce1b0c24108e5865379cc415d7e90f8994e0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for survey2ddi-0.2.0-py3-none-any.whl:

Publisher: publish.yml on CorrelAid/survey2ddi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: survey2ddi-0.2.0-py3-none-any.whl
- Subject digest: e096b3852a25a09401dc708a3879346f2ea60ff2c8401f246585a3641cc11ef0
- Sigstore transparency entry: 1518609247
- Sigstore integration time: May 12, 2026
Source repository:
- Permalink: CorrelAid/survey2ddi@eb551f1f383045503e62c1e3660b13481432a127
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/CorrelAid
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@eb551f1f383045503e62c1e3660b13481432a127
- Trigger Event: push

survey2ddi 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

survey2ddi

Why use this?

Setup

Usage

1. Pull data from your platform

2. Transform to DDI + CSV

Metadata only (no responses)

Examples

Running tests

Integration tests (Schematron)

Validating XML manually

Releasing to PyPI

Known limitations

Design

As a Python library

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance