Skip to main content

Graph-native data management platform for population genomics

Project description

GraphMana

Graph-native data management platform for population genomics.

GraphMana stores VCF/GVCF data as a persistent, queryable graph database with packed genotype arrays on Variant nodes, pre-computed population statistics, incremental sample addition, integrated functional annotations, cohort management, reference genome liftover, annotation versioning, and multi-format export. Target scale: 100–50,000 samples on a single machine or HPC cluster node.

Key Features

  • Packed genotype arrays — 2-bit-per-sample storage (125× smaller than per-sample edges)
  • Pre-computed population statistics — allele counts, frequencies, heterozygosity per population
  • Two access paths — FAST PATH (pre-computed arrays, seconds) and FULL PATH (unpack genotypes, linear in N)
  • Incremental sample addition — add new samples without re-processing existing data
  • 17 export formats — VCF, PLINK 1.9/2.0, EIGENSTRAT, TreeMix, SFS (dadi/fastsimcoal2), BED, TSV, Beagle, STRUCTURE, Genepop, haplotype, BGEN, GDS, Zarr, JSON
  • Cohort management — define sample subsets as graph queries, not file extractions
  • Annotation versioning — VEP, ClinVar, CADD, gene constraint, GO terms, pathways, regulatory BED
  • Automatic provenance — every operation logged with parameters, timestamps, sample counts
  • Export manifests — each export generates a .manifest.json sidecar for reproducibility
  • Reference genome liftover — coordinate transformation across assemblies
  • 58 CLI commands — organized into 9 functional domains, no programming required
  • HPC cluster support — two-step CSV pipeline, user-space Neo4j, SLURM/PBS scripts
  • Species-agnostic — diploid, haploid, and mixed-ploidy chromosomes
  • No admin privileges needed — installs entirely in user space

Installation

Full installation guide: docs/INSTALL.md

Quick Install (no admin needed)

curl -sSL https://raw.githubusercontent.com/jfmao/GraphMana/main/install.sh | bash

This installs conda (if needed), Python, Java, Neo4j, and GraphMana in one step.

pip install

conda create -n graphmana -c conda-forge -c bioconda python=3.12 cyvcf2 openjdk=21 -y
conda activate graphmana
pip install graphmana
graphmana setup-neo4j --install-dir ~/neo4j --memory-auto

The Java procedures JAR is bundled with the Python package — no Maven build needed. The setup-neo4j command automatically deploys the JAR to the Neo4j plugins directory.

Docker

git clone https://github.com/jfmao/GraphMana.git
cd GraphMana
docker compose up --build

HPC Cluster

conda activate graphmana
graphmana setup-neo4j --install-dir $HOME/neo4j --install-java --memory-auto

The --install-java flag downloads Eclipse Temurin JDK 21 to user space (no admin needed). See Vignette 08: HPC Cluster Deployment for SLURM/PBS workflows.

From Source (development)

git clone https://github.com/jfmao/GraphMana.git
cd GraphMana
conda create -n graphmana -c conda-forge -c bioconda python=3.12 cyvcf2 openjdk=21 maven -y
conda activate graphmana

# Build Java procedures (optional — JAR is pre-built and bundled)
cd graphmana-procedures && mvn clean package -DskipTests && cd ..

# Install Python CLI
cd graphmana-cli && pip install -e ".[dev]" && cd ..

# Run tests (1,439 tests)
cd graphmana-cli && pytest -v && cd ..

# Setup Neo4j
graphmana setup-neo4j --install-dir ~/neo4j --memory-auto

Quick Start

# Start Neo4j
graphmana neo4j-start --neo4j-home ~/neo4j --wait

# Import VCF data
graphmana ingest \
    --input my_variants.vcf.gz \
    --population-map populations.tsv \
    --neo4j-home ~/neo4j \
    --reference GRCh38

# Check database status
graphmana status --detailed

# Export to TreeMix (FAST PATH — seconds at any sample count)
graphmana export --format treemix --output treemix.gz

# Export filtered VCF
graphmana export --format vcf --output filtered.vcf.gz \
    --populations POP_A POP_B --filter-maf-min 0.05

# Export PLINK for GWAS
graphmana export --format plink --output gwas_data \
    --filter-variant-type SNP --filter-min-call-rate 0.95

CLI Commands (58 total)

GraphMana provides 58 commands organized into 9 functional domains. See Command Reference for the full documentation.

Domain Key Commands
Data Import ingest, prepare-csv, load-csv, merge, liftover
Annotation annotate load, load-clinvar, load-cadd, load-go, load-bed
Export export (17 formats), list-formats
Sample & Cohort sample remove/restore/reassign, cohort define/list/show
Quality Control qc, ref-check, db validate
Provenance provenance list/show/search/summary
Database Admin snapshot create/restore, db info/check, diff, save-state
Status status, summary, version, config-show
Infrastructure setup-neo4j, neo4j-start/stop, check-filesystem, cluster

Export Formats

Format Access Path Target Tool
TreeMix FAST TreeMix
SFS (dadi) FAST dadi, moments
SFS (fsc) FAST fastsimcoal2
BED FAST bedtools, IGV
TSV FAST General analysis
JSON FAST Programmatic
VCF/BCF FULL bcftools, GATK
PLINK 1.9 FULL PLINK
PLINK 2.0 FULL PLINK2
EIGENSTRAT FULL smartPCA, AdmixTools
Beagle FULL Beagle
STRUCTURE FULL STRUCTURE
Genepop FULL Genepop
Haplotype FULL selscan
BGEN FULL UK Biobank tools
GDS FULL SeqArray/R
Zarr FULL sgkit/Python

Documentation

Architecture

GraphMana is built on graph database technology (currently Neo4j Community Edition, free and open source). The companion GraphPop engine provides graph-native analytical computation (population statistics, selection scans, annotation-conditioned queries) on the same persistent database.

Software Stack

  • Database: Neo4j Community 5.x (graph database)
  • Java plugin: Pre-built JAR bundled with Python package (31 KB)
  • Python CLI: Python 3.11+, cyvcf2, numpy, Click (21,267 lines)
  • Testing: 1,439 unit and integration tests (pytest)

License

MIT License. See LICENSE.

Citation

If you use GraphMana in your research, please cite:

Mao, J. GraphMana: graph-native data management for population genomics projects. bioRxiv (2026). [DOI pending]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graphmana-1.1.0.tar.gz (326.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

graphmana-1.1.0-py3-none-any.whl (245.3 kB view details)

Uploaded Python 3

File details

Details for the file graphmana-1.1.0.tar.gz.

File metadata

  • Download URL: graphmana-1.1.0.tar.gz
  • Upload date:
  • Size: 326.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for graphmana-1.1.0.tar.gz
Algorithm Hash digest
SHA256 acd2f1dd07b4618cff779cff8a69d46482cf21a1cbb2e2351ee9bc1be6b3e5a1
MD5 fb7070ac67f86abf96f942b84f84940d
BLAKE2b-256 05ec087a56add01a289563314d20a123cb8c4d5d37042cce6fd8a4050fbe65bd

See more details on using hashes here.

File details

Details for the file graphmana-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: graphmana-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 245.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for graphmana-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ba837c66d2a326d71658e3d73ae628907a58d656d1dad580314cebae808177d
MD5 69d35234773bea389dcbbf49c460c481
BLAKE2b-256 68cebb2374a2d389e5c9a97a665acb3c69d987e4a02fbe8ed5ed153bc336256c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page