Skip to main content

Data transformation framework for LinkML data models

Project description

Koza - Knowledge Graph Transformation and Operations Toolkit

Pyversions PyPi Github Action

pupa

Documentation

Disclaimer: Koza is in beta - we are looking for testers!

Overview

Koza is a Python library and CLI tool for transforming biomedical data and performing graph operations on Knowledge Graph Exchange (KGX) files. It provides two main capabilities:

📊 Graph Operations (New!)

Powerful DuckDB-based operations for KGX knowledge graphs:

  • Join multiple KGX files with schema harmonization
  • Split files by field values with format conversion
  • Prune dangling edges and handle singleton nodes
  • Append new data to existing databases with schema evolution
  • Multi-format support for TSV, JSONL, and Parquet files

🔄 Data Transformation (Core)

Transform biomedical data sources into KGX format:

  • Transform csv, json, yaml, jsonl, and xml to target formats
  • Output in KGX format
  • Write data transforms in semi-declarative Python
  • Configure source files, columns/properties, and metadata in YAML
  • Create mapping files and translation tables between vocabularies

Installation

Koza is available on PyPi and can be installed via pip/pipx:

[pip|pipx] install koza

Usage

Quick Start with Graph Operations

Koza's graph operations work seamlessly across multiple KGX formats (TSV, JSONL, Parquet):

# Join multiple KGX files into a unified database
koza join --nodes genes.tsv pathways.jsonl --edges interactions.parquet --output merged_graph.duckdb

# Prune dangling edges and handle singleton nodes
koza prune --database merged_graph.duckdb --keep-singletons

# Append new data to existing database with schema evolution
koza append --database merged_graph.duckdb --nodes new_genes.tsv --edges new_interactions.jsonl

# Split database by source with format conversion
koza split --database merged_graph.duckdb --split-on provided_by --output-format parquet

NOTE: As of version 0.2.0, there is a new method for getting your ingest's KozaApp instance. Please see the updated documentation for details.

See the Koza documentation for complete usage information

Examples

Validate

Give Koza a local or remote csv file, and get some basic information (headers, number of rows)

koza validate \
  --file https://raw.githubusercontent.com/monarch-initiative/koza/main/examples/data/string.tsv \
  --delimiter ' '

Sending a json or jsonl formatted file will confirm if the file is valid json or jsonl

koza validate \
  --file ./examples/data/ZFIN_PHENOTYPE_0.jsonl.gz \
  --format jsonl
koza validate \
  --file ./examples/data/ddpheno.json.gz \
  --format json

Transform

Run the example ingest, "string/protein-links-detailed"

koza transform \
  --source examples/string/protein-links-detailed.yaml \
  --global-table examples/translation_table.yaml

koza transform \
  --source examples/string-declarative/protein-links-detailed.yaml \
  --global-table examples/translation_table.yaml

Note: Koza expects a directory structure as described in the above example
with the source config file and transform code in the same directory:

.
├── ...
│   ├── your_source
│   │   ├── your_ingest.yaml
│   │   └── your_ingest.py
│   └── some_translation_table.yaml
└── ...

Graph Operations

Create and manipulate knowledge graphs from existing KGX files:

# Join heterogeneous KGX files with automatic schema harmonization
koza join \
  --nodes genes.tsv proteins.jsonl pathways.parquet \
  --edges gene_protein.tsv protein_pathway.jsonl \
  --output unified_graph.duckdb \
  --schema-report

# Clean up graph integrity issues
koza prune \
  --database unified_graph.duckdb \
  --keep-singletons \
  --dry-run  # Preview changes before applying

# Incrementally add new data with schema evolution
koza append \
  --database unified_graph.duckdb \
  --nodes new_genes.tsv updated_pathways.jsonl \
  --deduplicate \
  --show-progress

# Export subsets with format conversion
koza split \
  --database unified_graph.duckdb \
  --split-on provided_by \
  --output-format parquet \
  --output-dir ./split_graphs

Key Features

🔧 Multi-Format Support

  • Native support for TSV, JSONL, and Parquet KGX files
  • Automatic format detection and conversion
  • Mixed-format operations in single commands

🛡️ Schema Flexibility

  • Automatic schema harmonization across heterogeneous files
  • Schema evolution with backward compatibility
  • Comprehensive schema reporting and validation

High Performance

  • DuckDB-powered operations for fast bulk processing
  • Memory-efficient handling of large knowledge graphs
  • Parallel processing and streaming where possible

🔍 Rich CLI Experience

  • Progress indicators for long-running operations
  • Detailed statistics and operation summaries
  • Dry-run modes for safe operation preview

🧹 Data Integrity

  • Dangling edge detection and preservation
  • Duplicate detection and removal strategies
  • Non-destructive operations with data archiving

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

koza-2.2.0.tar.gz (339.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

koza-2.2.0-py3-none-any.whl (102.4 kB view details)

Uploaded Python 3

File details

Details for the file koza-2.2.0.tar.gz.

File metadata

  • Download URL: koza-2.2.0.tar.gz
  • Upload date:
  • Size: 339.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.22

File hashes

Hashes for koza-2.2.0.tar.gz
Algorithm Hash digest
SHA256 4e6e8c88605ece3d2662a1bf748b541ade489e9e9ca55be4138f8d666f95d3e7
MD5 4b5d4c83e932d80fa802d3502dd0f7c0
BLAKE2b-256 f0cdd1c096644266aaf2c15d81b4689f8f92e67642bc04cf799a7e9499392acf

See more details on using hashes here.

File details

Details for the file koza-2.2.0-py3-none-any.whl.

File metadata

  • Download URL: koza-2.2.0-py3-none-any.whl
  • Upload date:
  • Size: 102.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.22

File hashes

Hashes for koza-2.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d11dc8ccfc621499d2fab6126431d4dc4915d960ed865bdaf74550745022f1be
MD5 bdcdcd8f88d82962a3f5b3269380fc06
BLAKE2b-256 362a1cc55246f941ac1b4a7847bf67560e9e79a85d11a5943c863ac8adf8b25e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page