DDI to Knowledge Graph toolkit - transform DDI metadata into graph databases (Neo4j, RDF, Gremlin, NetworkX)
Project description
ddigraph
A modern Python toolkit that transforms DDI (Data Documentation Initiative) XML metadata into knowledge graphs. Supports DDI Codebook and DDI-L FragmentInstance formats with streaming parsing, batched writes, and full async I/O across multiple graph backends.
Documentation | Getting Started | PyPI | Source Code
Features
- Multi-backend support -- Neo4j, RDF/SPARQL, Gremlin, NetworkX, and pandas
- Streaming XML processing -- Memory-bounded
iterparsefor files of any size - Batched writes -- UNWIND-based Cypher for 10-100x fewer database round trips
- Async I/O -- Concurrent parsing and writing with back-pressure control
- Format auto-detection -- Automatically identifies DDI Codebook vs Lifecycle format
- Unified schema -- Single source of truth for all node and relationship definitions
- Adapter pattern -- Plug in custom graph backends via
GraphWriteAdapterprotocol - Production-ready -- Retry logic, observability hooks, pydantic-based configuration
Quick Start
Install
pip install ddigraph
Load DDI metadata (CLI)
# Set Neo4j connection
export DDIGRAPH_NEO4J_URI=bolt://localhost:7687
export DDIGRAPH_NEO4J_USER=neo4j
export DDIGRAPH_NEO4J_PASSWORD=secret
# Bootstrap schema and load data (format is auto-detected)
ddigraph bootstrap
ddigraph load survey.xml --dataset-id my-survey
Load DDI metadata (Python)
import asyncio
from neo4j import AsyncGraphDatabase
from ddigraph import DDILoader, DDIFragmentLoader, detect_ddi_format
from ddigraph.config import Settings
async def main():
settings = Settings()
driver = AsyncGraphDatabase.driver(
settings.neo4j_uri,
auth=(settings.neo4j_user, settings.neo4j_password.get_secret_value()),
)
path = "survey.xml"
if detect_ddi_format(path) == "lifecycle":
loader = DDIFragmentLoader(driver, settings=settings)
result = await loader.load(path)
else:
loader = DDILoader(driver, settings=settings)
result = await loader.load(path, dataset_id="my-survey")
print(result) # {'Instrument': 1, 'Sequence': 388, 'QuestionItem': 373, ...}
await driver.close()
asyncio.run(main())
Supported Formats
| Format | Description | Use Case |
|---|---|---|
| DDI Codebook | Traditional flat format with central Dataset node | Survey archives, data catalogs |
| DDI-L FragmentInstance | Lifecycle 3.x format with reusable fragments | Questionnaire design, CAPI/CAWI instruments |
| DDI-CDI 1.0 | Cross-Domain Integration metadata | Data integration, statistical production |
XSD Coverage
ddigraph ships with 100 % coverage of every concrete identifiable element
declared in the bundled XSD schemas (schemas/). Coverage is enforced by the
audit script and a pytest guardrail so new schema releases surface any gaps:
| Flavor | Scope | Target | Covered |
|---|---|---|---|
| DDI-L 3.x | Concrete Maintainable + Versionable + Identifiable elements | 189 | 100 % |
| DDI-C 2.x | Codebook elements with the GLOBALS attribute group (no layout tags) |
73 | 100 % |
| DDI-CDI 1.0 | Concrete top-level entity elements (associations excluded) | 210 | 100 % |
Run python scripts/xsd_coverage.py to regenerate the audit or
python scripts/xsd_coverage.py --json for machine-readable output.
Supported Backends
| Backend | Description | Use Case |
|---|---|---|
| Neo4j | Native graph database (Bolt) | Production deployments, complex queries |
| RDF/SPARQL | Semantic web triplestores | Linked data, ontology integration |
| Gremlin | Graph traversal language | JanusGraph, Neptune, Cosmos DB |
| NetworkX | Python graph library | Local analysis, prototyping |
| pandas | DataFrame-based | Tabular analysis, Excel export |
Docker Quick Start
docker run --rm --name neo4j-demo \
-p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/password \
neo4j:5
export DDIGRAPH_NEO4J_URI=bolt://localhost:7687
export DDIGRAPH_NEO4J_USER=neo4j
export DDIGRAPH_NEO4J_PASSWORD=password
ddigraph bootstrap
ddigraph load your-file.xml --dataset-id demo
Documentation
Full documentation is available at pbisson44.github.io/ddigraph in English and French.
- Getting Started -- Installation, quick start, 10-minute tutorial
- User Guide -- Architecture, DDI formats, relationships, adapters
- Graph Backends -- Neo4j, RDF/SPARQL, Gremlin, NetworkX
- Reference -- CLI commands, configuration
- Advanced -- Performance tuning, AI readiness, standards interoperability
- Contributing -- How to contribute
Development
git clone https://github.com/pbisson44/ddigraph.git
cd ddigraph
pip install -e ".[dev,docs]"
ruff check . && ruff format .
# Docstring linting is currently enforced for src/ddigraph only.
pydocstyle src/ddigraph
mypy .
pytest
mkdocs serve
License
MIT -- see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ddigraph-0.4.1.tar.gz.
File metadata
- Download URL: ddigraph-0.4.1.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
894875000a5caca4bf811e29b486bdb481d8eb3658d68d34324d556527cb2f88
|
|
| MD5 |
6b3d946a037fa65be3e9d8ce296c243a
|
|
| BLAKE2b-256 |
4a35857644e270a6f3de20d1861efa9cdca5c63830ae1d9d3287578fac544b8e
|
Provenance
The following attestation bundles were made for ddigraph-0.4.1.tar.gz:
Publisher:
publish.yml on pbisson44/ddigraph
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ddigraph-0.4.1.tar.gz -
Subject digest:
894875000a5caca4bf811e29b486bdb481d8eb3658d68d34324d556527cb2f88 - Sigstore transparency entry: 1713644728
- Sigstore integration time:
-
Permalink:
pbisson44/ddigraph@be5e57ed6d67c8e2c8422daffac029c7c04195b8 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/pbisson44
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@be5e57ed6d67c8e2c8422daffac029c7c04195b8 -
Trigger Event:
push
-
Statement type:
File details
Details for the file ddigraph-0.4.1-py3-none-any.whl.
File metadata
- Download URL: ddigraph-0.4.1-py3-none-any.whl
- Upload date:
- Size: 1.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1281c94d48a6fc9aa25fc5861cebaf27e21ce438f73e24cabaed7410199ee36e
|
|
| MD5 |
eb93f5bfac056b501a706f0820836903
|
|
| BLAKE2b-256 |
a0058fbfb7783a1d4814e8a83e9dff180f3f92448496fb52ad81a31cb5318137
|
Provenance
The following attestation bundles were made for ddigraph-0.4.1-py3-none-any.whl:
Publisher:
publish.yml on pbisson44/ddigraph
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ddigraph-0.4.1-py3-none-any.whl -
Subject digest:
1281c94d48a6fc9aa25fc5861cebaf27e21ce438f73e24cabaed7410199ee36e - Sigstore transparency entry: 1713644750
- Sigstore integration time:
-
Permalink:
pbisson44/ddigraph@be5e57ed6d67c8e2c8422daffac029c7c04195b8 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/pbisson44
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@be5e57ed6d67c8e2c8422daffac029c7c04195b8 -
Trigger Event:
push
-
Statement type: