Transform survey data (KoboToolbox, LimeSurvey) into DDI-adjacent formats
Project description
This repository is part of the Civic Data Lab's Survey Toolbox. In the Toolbox you'll find information and guidance on everything related to surveys — from design to analysis.
survey2ddi
Bridge the gap between raw survey exports and archival-grade metadata.
Survey platforms like KoboToolbox and LimeSurvey are excellent for data collection, but their raw exports are often difficult to use for long-term archiving or secondary analysis. They frequently lack clear labels, structured metadata, and standardized formats.
survey2ddi transforms these raw exports into a standardized pair of files:
- DDI-Codebook 2.5 XML: A machine-readable schema containing all question texts, choice labels, and group structures. Compatible with the qwac question bank.
- DDI-Aligned CSV: Clean response data where column headers match the XML variable names exactly, and multi-select questions are expanded into binary indicators.
Why use this?
- Long-term Archiving: Move away from cryptic CSVs to self-documenting DDI metadata.
- Interoperability: Import your survey structure directly into tools like qwac.
- Data Analysis: Use the generated XML to automatically apply labels to your data in R, Python, or Stata (see our example notebook).
- Reproducibility: Maintain a strict link between your data collection instrument (XLSForm/TSV) and the resulting dataset.
Setup
Requires Python 3.13+. Install with uv:
uv sync
Copy .env.example to .env and fill in your credentials.
Usage
1. Pull data from your platform
First, download the raw responses and structure from your survey platform.
KoboToolbox:
uv run kobo2ddi pull <asset_uid>
LimeSurvey:
uv run limesurvey2ddi pull <survey_id>
# Then export the "Survey structure" (TSV) from LimeSurvey's admin UI
# and place it in the output folder as survey.tsv
LimeSurvey schema source is the survey-structure TSV (Surveys → Export → Survey structure). XLSForm input is no longer supported on the LimeSurvey side — use the xlsform2lstsv bridge if you author surveys in XLSForm.
2. Transform to DDI + CSV
Once the data is cached locally, generate the standardized outputs.
# KoboToolbox
uv run kobo2ddi transform <asset_uid>
# OR use a raw CSV export from the GUI instead of the API JSON
# (Must be exported with "XML values and headers" and use , or ; as delimiter)
uv run kobo2ddi transform <asset_uid> --data path/to/export.csv --title "My Study"
# LimeSurvey (uses cached responses.json + survey.tsv)
uv run limesurvey2ddi transform <survey_id> --title "My Research Study"
# OR use a raw CSV export from the GUI instead of the API JSON
# (Must be exported with "Question codes" as headers)
uv run limesurvey2ddi transform <survey_id> --data path/to/export.csv --title "My Study"
The output will be saved in output/<id>/ as <id>.xml and <id>.csv.
Metadata only (no responses)
If you only need the DDI codebook — e.g. before the survey is fielded, or to import the structure into qwac — skip the response data entirely. No API call, no CSV output.
# KoboToolbox: from a local form.xlsx
uv run kobo2ddi metadata path/to/form.xlsx --title "My Survey"
# LimeSurvey: survey-structure TSV
uv run limesurvey2ddi metadata path/to/survey.tsv --title "My Survey"
Output XML lands next to the schema file (override with -o).
Examples
We provide a basic example of the generated output and a Jupyter notebook showing how to use them for analysis in the examples/ directory.
To run the example notebook:
uv sync --group notebook
cd examples/basic
uv run jupyter notebook analysis_example.ipynb
Running tests
uv run pytest
Tests include XSD validation of generated XML against the official DDI-Codebook 2.5 schema (requires xmllint, auto-skipped if not available).
Integration tests (Schematron)
Schematron rules go beyond the XSD (uniqueness of IDs, consistency between variable groups and their members, _other conventions, etc.). They live in qwacback and are checked by a Java worker exposed behind POST /api/validate. The integration suite boots that stack via docker compose and posts generated XML to it.
uv run pytest -m integration
Requires Docker. The session fixture pulls ghcr.io/correlaid/qwacback{,-schematron-worker}:latest, waits for readiness, then tears the stack down. Cold start is ~20-30s; subsequent tests in the same session are ~1s.
To iterate faster, keep the stack up and point tests at it:
docker compose -f tests/integration/docker-compose.validate.yml up -d --wait
S2D_VALIDATE_URL=http://127.0.0.1:8090 uv run pytest -m integration
Pin to a specific qwacback build with QWACBACK_TAG=sha-abc1234 (or a semver like 0.1.0). Change PB_PORT if 8090 is taken.
Validating XML manually
xmllint --noout --schema tests/schemas/codebook.xsd output/<id>/<id>.xml
The schema files in tests/schemas/ are the official DDI-Codebook 2.5 XSD from the DDI Alliance.
Releasing to PyPI
Releases are published by /.github/workflows/publish.yml, which runs on any tag matching v*. Authentication is via PyPI Trusted Publishers (OIDC) — no API token is stored in the repo.
To cut a new release:
# 1. Bump version in pyproject.toml
# 2. Commit and tag
git add pyproject.toml
git commit -m "chore: release vX.Y.Z"
git tag vX.Y.Z
git push origin main --tags
The workflow builds with uv build, runs the test suite, and publishes the resulting sdist + wheel to https://pypi.org/project/survey2ddi/. Versions are hand-bumped — no automated semantic-release.
One-time setup (already done for this repo):
- PyPI → account → Publishing → add Trusted Publisher pointing at
CorrelAid/survey2ddiworkflowpublish.ymlin environmentpypi. - GitHub → repo settings → Environments → create environment
pypi.
Known limitations
Multi-language forms: Only the first label::* column in the XLSForm is used. For bilingual forms, place the preferred language column first.
Repeat groups: Variables inside begin_repeat/end_repeat blocks are silently skipped. KoboToolbox stores repeat data as nested arrays which require a different data model; this is not currently supported.
Plain groups in DDI XML: begin_group/end_group blocks without appearance="table-list" are not emitted as <varGrp> in the XML — their variables appear as standalone <var> elements. Groups with appearance="table-list" become <varGrp type="grid">.
LimeSurvey select_multiple bracket keys: LimeSurvey truncates option codes to 5 characters in its export (e.g. metall → metal). The transform recovers the original code via prefix matching. This fails if two choice codes share the same first 5 characters — a ValueError is raised in that case. It also fails silently (with a warning) if LimeSurvey uses internal answer codes that have no relation to the schema's choice names.
Design
The transform and DDI XML modules (kobo2ddi/transform.py, kobo2ddi/ddi_xml.py) are schema-source-agnostic — they consume parsed survey rows + choices regardless of where they came from. The LimeSurvey adapter (limesurvey2ddi/lstsv.py) parses LimeSurvey's survey-structure TSV into the same shape; limesurvey2ddi/transform.py then normalises LimeSurvey's response export quirks (underscore stripping, select_multiple sub-columns) before passing data to the same core functions.
As a Python library
# KoboToolbox
from kobo2ddi.client import KoboClient
from kobo2ddi.data import build_data_csv
from kobo2ddi.ddi_xml import build_ddi_xml
from kobo2ddi.transform import parse_xlsform, extract_variables
client = KoboClient()
asset = client.get_asset("your_asset_uid")
submissions = client.get_submissions("your_asset_uid")
survey_rows, choices, settings = parse_xlsform(Path("form.xlsx"))
xml_string = build_ddi_xml(asset["name"], survey_rows, choices, settings, submissions)
variables = extract_variables(survey_rows, choices)
csv_string = build_data_csv(variables, submissions)
# LimeSurvey
from limesurvey2ddi.transform import build_data_csv, build_ddi_xml
responses = [...] # from LimeSurvey API
xml_string = build_ddi_xml("My Survey", Path("survey.tsv"), responses)
csv_string = build_data_csv(Path("survey.tsv"), responses)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file survey2ddi-0.2.0.tar.gz.
File metadata
- Download URL: survey2ddi-0.2.0.tar.gz
- Upload date:
- Size: 229.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a6fabb745b2b1e4694eef18f300edac0c63315d3de1d235abd01681ea402eec4
|
|
| MD5 |
90e27202689e578ceac039ced9a73653
|
|
| BLAKE2b-256 |
25813ede4522f3d56ed3d2693a097844f1c669ec49feb60083a6c9ef2db8cc6f
|
Provenance
The following attestation bundles were made for survey2ddi-0.2.0.tar.gz:
Publisher:
publish.yml on CorrelAid/survey2ddi
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
survey2ddi-0.2.0.tar.gz -
Subject digest:
a6fabb745b2b1e4694eef18f300edac0c63315d3de1d235abd01681ea402eec4 - Sigstore transparency entry: 1518609180
- Sigstore integration time:
-
Permalink:
CorrelAid/survey2ddi@eb551f1f383045503e62c1e3660b13481432a127 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/CorrelAid
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@eb551f1f383045503e62c1e3660b13481432a127 -
Trigger Event:
push
-
Statement type:
File details
Details for the file survey2ddi-0.2.0-py3-none-any.whl.
File metadata
- Download URL: survey2ddi-0.2.0-py3-none-any.whl
- Upload date:
- Size: 28.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e096b3852a25a09401dc708a3879346f2ea60ff2c8401f246585a3641cc11ef0
|
|
| MD5 |
567d2ea8e5bf5b0841e0bf4c40ad03c4
|
|
| BLAKE2b-256 |
16c3e101d62a9c4c9b1cd49e35d4ce1b0c24108e5865379cc415d7e90f8994e0
|
Provenance
The following attestation bundles were made for survey2ddi-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on CorrelAid/survey2ddi
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
survey2ddi-0.2.0-py3-none-any.whl -
Subject digest:
e096b3852a25a09401dc708a3879346f2ea60ff2c8401f246585a3641cc11ef0 - Sigstore transparency entry: 1518609247
- Sigstore integration time:
-
Permalink:
CorrelAid/survey2ddi@eb551f1f383045503e62c1e3660b13481432a127 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/CorrelAid
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@eb551f1f383045503e62c1e3660b13481432a127 -
Trigger Event:
push
-
Statement type: