Graph-native data management platform for population genomics
Project description
GraphMana
Graph-native data management platform for population genomics.
GraphMana stores VCF/GVCF data as a persistent, queryable graph database with packed genotype arrays on Variant nodes, pre-computed population statistics, incremental sample addition, integrated functional annotations, cohort management, reference genome liftover, annotation versioning, and multi-format export. Target scale: 100–50,000 samples on a single machine or HPC cluster node.
Key Features
- Packed genotype arrays — 2-bit-per-sample storage (125× smaller than per-sample edges)
- Pre-computed population statistics — allele counts, frequencies, heterozygosity per population
- Two access paths — FAST PATH (pre-computed arrays, seconds) and FULL PATH (unpack genotypes, linear in N)
- Incremental sample addition — add new samples without re-processing existing data
- 17 export formats — VCF, PLINK 1.9/2.0, EIGENSTRAT, TreeMix, SFS (dadi/fastsimcoal2), BED, TSV, Beagle, STRUCTURE, Genepop, haplotype, BGEN, GDS, Zarr, JSON
- Cohort management — define sample subsets as graph queries, not file extractions
- Annotation versioning — VEP, ClinVar, CADD, gene constraint, GO terms, pathways, regulatory BED
- Automatic provenance — every operation logged with parameters, timestamps, sample counts
- Export manifests — each export generates a
.manifest.jsonsidecar for reproducibility - Reference genome liftover — coordinate transformation across assemblies
- 58 CLI commands — organized into 9 functional domains, no programming required
- HPC cluster support — two-step CSV pipeline, user-space Neo4j, SLURM/PBS scripts
- Species-agnostic — diploid, haploid, and mixed-ploidy chromosomes
- No admin privileges needed — installs entirely in user space
Installation
Full installation guide: docs/INSTALL.md
Quick Install (no admin needed)
curl -sSL https://raw.githubusercontent.com/jfmao/GraphMana/main/install.sh | bash
This installs conda (if needed), Python, Java, Neo4j, and GraphMana in one step.
pip install
conda create -n graphmana -c conda-forge -c bioconda python=3.12 cyvcf2 openjdk=21 -y
conda activate graphmana
pip install graphmana
graphmana setup-neo4j --install-dir ~/neo4j --memory-auto
The Java procedures JAR is bundled with the Python package — no Maven build needed.
The setup-neo4j command automatically deploys the JAR to the Neo4j plugins directory.
Docker
git clone https://github.com/jfmao/GraphMana.git
cd GraphMana
docker compose up --build
- Neo4j Browser: http://localhost:7474 (neo4j/graphmana)
- Bolt endpoint: bolt://localhost:7687
HPC Cluster
conda activate graphmana
graphmana setup-neo4j --install-dir $HOME/neo4j --install-java --memory-auto
The --install-java flag downloads Eclipse Temurin JDK 21 to user space (no admin needed).
See Vignette 08: HPC Cluster Deployment for SLURM/PBS workflows.
From Source (development)
git clone https://github.com/jfmao/GraphMana.git
cd GraphMana
conda create -n graphmana -c conda-forge -c bioconda python=3.12 cyvcf2 openjdk=21 maven -y
conda activate graphmana
# Build Java procedures (optional — JAR is pre-built and bundled)
cd graphmana-procedures && mvn clean package -DskipTests && cd ..
# Install Python CLI
cd graphmana-cli && pip install -e ".[dev]" && cd ..
# Run tests (1,439 tests)
cd graphmana-cli && pytest -v && cd ..
# Setup Neo4j
graphmana setup-neo4j --install-dir ~/neo4j --memory-auto
Quick Start
# Start Neo4j
graphmana neo4j-start --neo4j-home ~/neo4j --wait
# Import VCF data
graphmana ingest \
--input my_variants.vcf.gz \
--population-map populations.tsv \
--neo4j-home ~/neo4j \
--reference GRCh38
# Check database status
graphmana status --detailed
# Export to TreeMix (FAST PATH — seconds at any sample count)
graphmana export --format treemix --output treemix.gz
# Export filtered VCF
graphmana export --format vcf --output filtered.vcf.gz \
--populations POP_A POP_B --filter-maf-min 0.05
# Export PLINK for GWAS
graphmana export --format plink --output gwas_data \
--filter-variant-type SNP --filter-min-call-rate 0.95
CLI Commands (58 total)
GraphMana provides 58 commands organized into 9 functional domains. See Command Reference for the full documentation.
| Domain | Key Commands |
|---|---|
| Data Import | ingest, prepare-csv, load-csv, merge, liftover |
| Annotation | annotate load, load-clinvar, load-cadd, load-go, load-bed |
| Export | export (17 formats), list-formats |
| Sample & Cohort | sample remove/restore/reassign, cohort define/list/show |
| Quality Control | qc, ref-check, db validate |
| Provenance | provenance list/show/search/summary |
| Database Admin | snapshot create/restore, db info/check, diff, save-state |
| Status | status, summary, version, config-show |
| Infrastructure | setup-neo4j, neo4j-start/stop, check-filesystem, cluster |
Export Formats
| Format | Access Path | Target Tool |
|---|---|---|
| TreeMix | FAST | TreeMix |
| SFS (dadi) | FAST | dadi, moments |
| SFS (fsc) | FAST | fastsimcoal2 |
| BED | FAST | bedtools, IGV |
| TSV | FAST | General analysis |
| JSON | FAST | Programmatic |
| VCF/BCF | FULL | bcftools, GATK |
| PLINK 1.9 | FULL | PLINK |
| PLINK 2.0 | FULL | PLINK2 |
| EIGENSTRAT | FULL | smartPCA, AdmixTools |
| Beagle | FULL | Beagle |
| STRUCTURE | FULL | STRUCTURE |
| Genepop | FULL | Genepop |
| Haplotype | FULL | selscan |
| BGEN | FULL | UK Biobank tools |
| GDS | FULL | SeqArray/R |
| Zarr | FULL | sgkit/Python |
Documentation
- Installation Guide — 5 installation methods, no admin needed
- Command Reference — 65 command reference pages
- Vignettes — 11 tutorial vignettes
- Cluster Deployment — SLURM/PBS guide
Architecture
GraphMana is built on graph database technology (currently Neo4j Community Edition, free and open source). The companion GraphPop engine provides graph-native analytical computation (population statistics, selection scans, annotation-conditioned queries) on the same persistent database.
Software Stack
- Database: Neo4j Community 5.x (graph database)
- Java plugin: Pre-built JAR bundled with Python package (31 KB)
- Python CLI: Python 3.11+, cyvcf2, numpy, Click (21,267 lines)
- Testing: 1,439 unit and integration tests (pytest)
License
MIT License. See LICENSE.
Citation
If you use GraphMana in your research, please cite:
Mao, J. GraphMana: graph-native data management for population genomics projects. bioRxiv (2026). [DOI pending]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file graphmana-1.1.0.tar.gz.
File metadata
- Download URL: graphmana-1.1.0.tar.gz
- Upload date:
- Size: 326.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acd2f1dd07b4618cff779cff8a69d46482cf21a1cbb2e2351ee9bc1be6b3e5a1
|
|
| MD5 |
fb7070ac67f86abf96f942b84f84940d
|
|
| BLAKE2b-256 |
05ec087a56add01a289563314d20a123cb8c4d5d37042cce6fd8a4050fbe65bd
|
File details
Details for the file graphmana-1.1.0-py3-none-any.whl.
File metadata
- Download URL: graphmana-1.1.0-py3-none-any.whl
- Upload date:
- Size: 245.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ba837c66d2a326d71658e3d73ae628907a58d656d1dad580314cebae808177d
|
|
| MD5 |
69d35234773bea389dcbbf49c460c481
|
|
| BLAKE2b-256 |
68cebb2374a2d389e5c9a97a665acb3c69d987e4a02fbe8ed5ed153bc336256c
|