Multi-GWAS gene-overlap analysis, similarity heatmaps, Circos-style visualization, and reporting.

These details have not been verified by PyPI

Project description

Tardigrade GWAS

Tardigrade GWAS is a lightweight Python library for comparing gene-level signals across multiple GWAS summary-statistics files. It provides significant-gene extraction, pairwise overlap analysis, Jaccard similarity, cross-dataset consensus scoring, report generation, similarity heatmaps, and Circos-style GWAS visualization.

Developed by Amir Izadi and Zahra Aghabeygi.

Tardigrade GWAS is a research utility. It is not a clinical diagnostic system and should not be used as the sole basis for medical decisions.

Features

Read CSV, TSV, TXT, .csv.gz, and .tsv.gz GWAS tables
Detect common GWAS column aliases case-insensitively
Filter associations by a configurable P-value threshold
Parse multi-gene fields separated by commas or semicolons
Count significant-gene occurrences per dataset
Calculate gene evidence scores using sum(-log10(P))
Calculate pairwise gene-set intersections
Calculate pairwise Jaccard similarity
Build a cross-dataset consensus table
Find genes shared across all datasets
Generate text reports and CSV result tables
Generate Jaccard similarity heatmaps
Generate Circos-style chromosome plots
Return standard Pandas DataFrames and Matplotlib Figures for further customization

Installation

pip install tardigrade-gwas

Upgrade to the latest available version:

pip install --upgrade tardigrade-gwas

Check the installed version:

import tardigrade_gwas as tg

print(tg.__version__)

Quick start

import tardigrade_gwas as tg

result = tg.analyze_files(
    ["disease_a.csv", "disease_b.tsv"],
    names=["Disease A", "Disease B"],
)

print(result.report_text)
print(result.similarity_table)
print(result.consensus_table.head())

Save the generated report and all result tables:

result.save_report("outputs/report.txt")
result.save_tables("outputs/tables")

Input data

Supported file types

.csv
.tsv
.txt
.csv.gz
.tsv.gz

For .txt files or files without a recognized CSV/TSV extension, Tardigrade attempts to detect comma or tab delimiters automatically.

Minimum columns for overlap analysis

The default analysis columns are:

Purpose	Default column
P value	`P-VALUE`
Gene annotation	`MAPPED_GENE`

Example:

P-VALUE,MAPPED_GENE
1.2e-10,TP53
4.8e-9,BRCA1
2.3e-6,APOE
7.1e-12,"IL6;STAT3"

Only rows satisfying P-value <= p_threshold are included in the significant-gene analysis. The default threshold is 5e-8.

Columns required for Circos-style plots

Purpose	Default column
Chromosome	`CHR_ID`
Base-pair position	`CHR_POS`
P value	`P-VALUE`
Gene annotation	`MAPPED_GENE`

Example:

P-VALUE,MAPPED_GENE,CHR_ID,CHR_POS
1.2e-10,TP53,17,7676594
4.8e-9,BRCA1,17,43044295
3.1e-12,APOE,19,44908684

Recognized column aliases

Column matching is case-insensitive.

Data type	Recognized names
P value	`P-VALUE`, `P`, `PVALUE`, `P_VALUE`
Gene	`MAPPED_GENE`, `REPORTED GENE(S)`, `GENE`, `GENES`
Chromosome	`CHR_ID`, `CHR`, `CHROM`, `CHROMOSOME`
Position	`CHR_POS`, `BP`, `POS`, `POSITION`, `BASE_PAIR_LOCATION`

Custom column names can always be passed explicitly.

Main analysis API

`analyze_files`

Use analyze_files when you have a list of file paths.

from pathlib import Path
import tardigrade_gwas as tg

result = tg.analyze_files(
    files=[
        Path("data/alzheimer.csv"),
        Path("data/parkinson.csv"),
        Path("data/multiple_sclerosis.tsv"),
    ],
    names=[
        "Alzheimer Disease",
        "Parkinson Disease",
        "Multiple Sclerosis",
    ],
    p_column="P-VALUE",
    gene_column="MAPPED_GENE",
    p_threshold=5e-8,
    top_n=20,
)

Parameters:

Parameter	Type	Default	Description
`files`	sequence of paths	required	GWAS files to analyze
`names`	sequence of strings or `None`	`None`	Display names for the datasets
`p_column`	string	`P-VALUE`	Preferred P-value column
`gene_column`	string	`MAPPED_GENE`	Preferred gene column
`p_threshold`	float	`5e-8`	Significance threshold in `(0, 1]`
`top_n`	integer	`20`	Number of top genes shown in report sections

When names is omitted, names are derived from filenames. Duplicate derived names receive suffixes such as _2, _3, and so on.

`analyze_datasets`

Use analyze_datasets when each dataset already has an explicit display name.

import tardigrade_gwas as tg

result = tg.analyze_datasets(
    [
        ("Alzheimer Disease", "data/ad.csv"),
        ("Parkinson Disease", "data/pd.csv"),
    ],
    p_threshold=5e-8,
)

Dataset names must be non-empty and unique.

`run_multi_gwas`

run_multi_gwas is the lower-level analysis function. It returns a dictionary rather than a MultiGWASResult object.

raw_result = tg.run_multi_gwas(
    datasets=[
        ("Disease A", "a.csv"),
        ("Disease B", "b.csv"),
    ],
    p_column="P-VALUE",
    gene_column="MAPPED_GENE",
    p_threshold=5e-8,
    top_n=20,
)

print(raw_result["report_text"])

For normal use, analyze_files or analyze_datasets is recommended.

Understanding the result object

analyze_files and analyze_datasets return a MultiGWASResult object.

result = tg.analyze_files(["a.csv", "b.csv"])

The object contains:

result.dataset_names
result.report_text
result.similarity_table
result.consensus_table
result.all_common_table
result.pairwise_full_tables

`dataset_names`

Tuple containing dataset names in input order:

print(result.dataset_names)

`report_text`

A complete plain-text report containing:

original row counts
valid P-value counts
zero P-value counts
significant row counts
unique significant-gene counts
top genes by occurrence count
top genes by weighted evidence score
pairwise overlaps
Jaccard similarity values
cross-dataset consensus results
genes shared by all datasets

print(result.report_text)

Save it:

path = result.save_report("outputs/multi_gwas_report.txt")
print(path)

`similarity_table`

Pairwise Jaccard similarities between significant-gene sets:

print(result.similarity_table)

The Jaccard similarity is:

Jaccard(A, B) = |A intersection B| / |A union B|

Its value ranges from 0 to 1:

0: no shared significant genes
1: identical significant-gene sets

The table has the columns:

A, B, jaccard

`consensus_table`

A gene-level table combining evidence across all input datasets:

print(result.consensus_table.head(20))

For every dataset, it contains:

<dataset>_count: number of significant rows containing the gene
<dataset>_wscore: sum of -log10(P) for rows containing the gene

It also contains:

presence: number of datasets in which the gene is significant
total_count: total significant-row count across all datasets
total_wscore: total weighted score across all datasets

The table is sorted by presence, then total_wscore, then total_count, all in descending order.

Filter genes present in at least two datasets:

shared = result.consensus_table[
    result.consensus_table["presence"] >= 2
]

print(shared)

Get the ten genes with the strongest combined weighted evidence:

top_weighted = result.consensus_table.nlargest(10, "total_wscore")
print(top_weighted)

`all_common_table`

Genes found in every input dataset:

if result.all_common_table.empty:
    print("No gene is shared by all datasets.")
else:
    print(result.all_common_table)

This table is only applicable when at least two datasets are analyzed.

`pairwise_full_tables`

Dictionary of complete pairwise overlap tables:

for (name_a, name_b), table in result.pairwise_full_tables.items():
    print(f"{name_a} vs {name_b}")
    print(table.head())

Access one specific comparison:

pair_table = result.pairwise_full_tables[
    ("Disease A", "Disease B")
]

If the two datasets have no shared significant genes, the corresponding DataFrame is empty.

Convert the result to a dictionary

result_dict = result.as_dict()

The returned dictionary contains:

report_text
similarity_table
consensus_table
all_common_table
pairwise_full_tables

Save all result tables

written_files = result.save_tables("outputs/tables")

for path in written_files:
    print(path)

Output structure:

outputs/tables/
├── similarity.csv
├── consensus.csv
├── all_common.csv
└── pairwise/
    ├── Disease_A__Disease_B.csv
    └── Disease_A__Disease_C.csv

Similarity heatmap

High-level API

import matplotlib.pyplot as plt
import tardigrade_gwas as tg

result = tg.analyze_files(
    ["disease_a.csv", "disease_b.csv", "disease_c.csv"],
    names=["Disease A", "Disease B", "Disease C"],
)

figure = tg.plot_similarity_heatmap(
    result,
    title="Disease similarity based on significant genes",
    figsize=(9, 7),
)

plt.show()

The plot is not saved automatically. Save it only when needed:

figure.savefig(
    "outputs/similarity_heatmap.png",
    dpi=300,
    bbox_inches="tight",
)

Other formats supported by Matplotlib include PDF and SVG:

figure.savefig("outputs/heatmap.pdf", bbox_inches="tight")
figure.savefig("outputs/heatmap.svg", bbox_inches="tight")

You can pass a custom ordering or subset of names:

figure = tg.plot_similarity_heatmap(
    result,
    names=["Disease C", "Disease A", "Disease B"],
)

Build the similarity matrix without plotting

matrix = tg.build_similarity_matrix(
    result.similarity_table,
    result.dataset_names,
)

print(matrix)

Draw on an existing Matplotlib axis

import matplotlib.pyplot as plt
import tardigrade_gwas as tg

fig, ax = plt.subplots(figsize=(8, 6))

tg.draw_similarity_heatmap(
    ax,
    similarity_table=result.similarity_table,
    names=result.dataset_names,
    title="Disease similarity (Jaccard)",
)

fig.tight_layout()
plt.show()

Circos-style GWAS plot

High-level API

import matplotlib.pyplot as plt
import tardigrade_gwas as tg

figure = tg.plot_circos(
    "data/disease.csv",
    chromosome_column="CHR_ID",
    position_column="CHR_POS",
    p_column="P-VALUE",
    gene_column="MAPPED_GENE",
    p_threshold=5e-8,
    maximum_points=120_000,
    label_top_n=12,
    random_seed=2026,
    figsize=(10, 10),
)

plt.show()

Save the plot:

figure.savefig(
    "outputs/disease_circos.png",
    dpi=300,
    bbox_inches="tight",
)

Parameters:

Parameter	Default	Description
`chromosome_column`	`CHR_ID`	Chromosome column
`position_column`	`CHR_POS`	Base-pair position column
`p_column`	`P-VALUE`	P-value column
`gene_column`	`MAPPED_GENE`	Gene column used for labels
`p_threshold`	`5e-8`	Significance threshold ring
`maximum_points`	`120000`	Maximum plotted points after downsampling
`label_top_n`	`12`	Number of top unique genes prepared for annotation
`random_seed`	`2026`	Seed controlling deterministic schematic chromosome bands
`figsize`	`(9, 9)`	Figure size in inches

Set label_top_n=0 to disable gene annotations.

Draw on an existing polar axis

import matplotlib.pyplot as plt
import tardigrade_gwas as tg

fig, ax = plt.subplots(
    figsize=(10, 10),
    subplot_kw={"projection": "polar"},
)

label_artists = tg.draw_circular_on_ax(
    gwas_path="data/disease.csv",
    ax=ax,
    p_threshold=5e-8,
    maximum_points=120_000,
    label_top_n=12,
)

fig.tight_layout()
plt.show()

draw_circular_on_ax returns a list of GeneLabelArtist objects. Their annotations are initially hidden. To display all generated labels:

for artist in label_artists:
    artist.annotation.set_visible(True)

fig.canvas.draw_idle()

Circos data rules

Supported chromosomes: 1 through 22, X, and Y
Values such as chr1, chrX, and 1.0 are normalized
Positions must be numeric, greater than zero, and within the hg38 chromosome length
P values must be numeric and between 0 and 1
The plot uses hg38 chromosome lengths
The chromosome band pattern is schematic and is not a true cytoband ideogram
maximum_points reduces the number of plotted points, but the file is still read into memory before downsampling

Read a GWAS table directly

import tardigrade_gwas as tg

df = tg.read_any_table("data/disease.tsv.gz")

print(df.head())
print(df.columns.tolist())
print(df.shape)

The function returns a Pandas DataFrame.

Gene parsing and utility functions

Split a gene field

import tardigrade_gwas as tg

genes = tg.split_genes("TP53; BRCA1, APOE")
print(genes)

Output:

["TP53", "BRCA1", "APOE"]

Commas and semicolons are supported. Duplicate genes within one cell are removed while original order is preserved.

The following tokens are treated as missing values:

"", -, ., NA, N/A, NAN, NONE, NULL, NR

Extract a unique gene set

gene_set = tg.extract_gene_set(df["MAPPED_GENE"])
print(len(gene_set))

Count gene occurrences

counts = tg.extract_gene_counts(df["MAPPED_GENE"])
print(counts.head(20))

Calculate weighted gene scores

scores = tg.extract_gene_weighted_scores(
    df,
    p_column="P-VALUE",
    gene_column="MAPPED_GENE",
)

print(scores.head(20))

For each valid row, the weight is:

-log10(P-value)

If a row contains multiple genes, the complete row weight is assigned to every listed gene; it is not divided among them.

Calculate Jaccard similarity directly

set_a = {"TP53", "APOE", "BRCA1"}
set_b = {"TP53", "APOE", "IL6"}

score = tg.jaccard_similarity(set_a, set_b)
print(score)  # 0.5

If both sets are empty, the function returns 0.0.

Custom column names

Suppose your table uses:

PVALUE, GENE_SYMBOL, CHROMOSOME, BASE_PAIR

Analyze it with:

result = tg.analyze_files(
    ["custom.csv"],
    p_column="PVALUE",
    gene_column="GENE_SYMBOL",
)

Create a Circos-style plot with:

figure = tg.plot_circos(
    "custom.csv",
    p_column="PVALUE",
    gene_column="GENE_SYMBOL",
    chromosome_column="CHROMOSOME",
    position_column="BASE_PAIR",
)

Error handling

import tardigrade_gwas as tg

try:
    result = tg.analyze_files(["a.csv", "b.csv"])
except FileNotFoundError as error:
    print(f"File error: {error}")
except ValueError as error:
    print(f"Input error: {error}")

Common errors include:

file not found
empty input file
malformed CSV/TSV table
missing required column
duplicate explicit dataset name
number of supplied names not matching the number of files
invalid p_threshold
invalid top_n
no valid chromosome/position/P-value rows for Circos plotting

Complete workflow example

from pathlib import Path
import matplotlib.pyplot as plt
import tardigrade_gwas as tg

output_dir = Path("outputs")
output_dir.mkdir(parents=True, exist_ok=True)

files = [
    Path("data/alzheimer.csv"),
    Path("data/parkinson.csv"),
    Path("data/multiple_sclerosis.tsv"),
]

names = [
    "Alzheimer Disease",
    "Parkinson Disease",
    "Multiple Sclerosis",
]

result = tg.analyze_files(
    files,
    names=names,
    p_column="P-VALUE",
    gene_column="MAPPED_GENE",
    p_threshold=5e-8,
    top_n=20,
)

print(result.report_text)

result.save_report(output_dir / "multi_gwas_report.txt")
result.save_tables(output_dir / "tables")

heatmap = tg.plot_similarity_heatmap(
    result,
    title="GWAS gene-set similarity",
    figsize=(9, 7),
)
heatmap.savefig(
    output_dir / "similarity_heatmap.png",
    dpi=300,
    bbox_inches="tight",
)

for file_path, dataset_name in zip(files, names, strict=True):
    safe_name = dataset_name.lower().replace(" ", "_")

    circos = tg.plot_circos(
        file_path,
        p_threshold=5e-8,
        maximum_points=120_000,
        label_top_n=12,
        figsize=(10, 10),
    )

    circos.savefig(
        output_dir / f"{safe_name}_circos.png",
        dpi=300,
        bbox_inches="tight",
    )

plt.show()

Public API summary

Analysis

tg.analyze_files
tg.analyze_datasets
tg.run_multi_gwas

Input

tg.read_any_table

Gene utilities

tg.split_genes
tg.extract_gene_set
tg.extract_gene_counts
tg.extract_gene_weighted_scores
tg.jaccard_similarity

Plotting

tg.plot_similarity_heatmap
tg.plot_circos
tg.build_similarity_matrix
tg.draw_similarity_heatmap
tg.draw_circular_on_ax

Defaults

tg.DEFAULT_P_THRESHOLD
tg.DEFAULT_P_COL
tg.DEFAULT_GENE_COL
tg.DEFAULT_CHR_COL
tg.DEFAULT_BP_COL
tg.DEFAULT_MAX_PLOT_POINTS

Scientific and technical limitations

Gene symbols are compared as exact strings.
Gene aliases are not harmonized automatically; for example, PARK2 and PRKN remain distinct.
Gene-symbol capitalization is not normalized; TP53 and tp53 may be treated as different genes.
Genome-build conversion is not performed.
Circos positions are validated against hg38 chromosome lengths.
Circos chromosome bands are schematic, not true cytogenetic bands.
Zero P values are replaced internally by the smallest positive floating-point value when calculating -log10(P); this can produce very large weighted scores.
Tables are loaded into memory with Pandas; very large GWAS files may require substantial RAM.
The package does not currently perform LD clumping, meta-analysis, pathway enrichment, gene ontology analysis, SNP annotation, liftover, QQ plots, or Manhattan plots.
Outputs should be interpreted in the context of the source datasets and their original study designs.

Authors

Amir Izadi
Zahra Aghabeygi

License

Tardigrade GWAS is distributed under the MIT License.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.1

Jun 14, 2026

1.0.0

Jun 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tardigrade_gwas-1.0.1.tar.gz (26.6 kB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tardigrade_gwas-1.0.1-py3-none-any.whl (24.4 kB view details)

Uploaded Jun 14, 2026 Python 3

File details

Details for the file tardigrade_gwas-1.0.1.tar.gz.

File metadata

Download URL: tardigrade_gwas-1.0.1.tar.gz
Upload date: Jun 14, 2026
Size: 26.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tardigrade_gwas-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`422efb5219de6b0004de724a72b7d3ce7230c0555ccf82b2228a8b3f7ba847f2`
MD5	`42c7cdf12e0be5abeb494f0f45f4043c`
BLAKE2b-256	`57f469bbf6b6bb005d65957a92368a71fb0a6f4db74708041cc2cab31d02ad9d`

See more details on using hashes here.

File details

Details for the file tardigrade_gwas-1.0.1-py3-none-any.whl.

File metadata

Download URL: tardigrade_gwas-1.0.1-py3-none-any.whl
Upload date: Jun 14, 2026
Size: 24.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tardigrade_gwas-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77984db65acea2e4b7d8ffb1a72ecd448ba62fb3638f00715dbadadee7babe26`
MD5	`e158f0b38eeb7a37f2e7aa3af231ed05`
BLAKE2b-256	`34e704e930e41251288d066bd3e6bf6f98509d17652a6c013031cf0e354394be`

See more details on using hashes here.

tardigrade-gwas 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Tardigrade GWAS

Features

Installation

Quick start

Input data

Supported file types

Minimum columns for overlap analysis

Columns required for Circos-style plots

Recognized column aliases

Main analysis API

analyze_files

analyze_datasets

run_multi_gwas

Understanding the result object

dataset_names

report_text

similarity_table

consensus_table

all_common_table

pairwise_full_tables

Convert the result to a dictionary

Save all result tables

Similarity heatmap

High-level API

Build the similarity matrix without plotting

Draw on an existing Matplotlib axis

Circos-style GWAS plot

High-level API

Draw on an existing polar axis

Circos data rules

Read a GWAS table directly

Gene parsing and utility functions

Split a gene field

Extract a unique gene set

Count gene occurrences

Calculate weighted gene scores

Calculate Jaccard similarity directly

Custom column names

Error handling

Complete workflow example

Public API summary

Analysis

Input

Gene utilities

Plotting

Defaults

Scientific and technical limitations

Authors

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`analyze_files`

`analyze_datasets`

`run_multi_gwas`

`dataset_names`

`report_text`

`similarity_table`

`consensus_table`

`all_common_table`

`pairwise_full_tables`