Multi-GWAS gene-overlap analysis, similarity heatmaps, Circos-style visualization, and reporting.
Project description
Tardigrade GWAS
Tardigrade GWAS is a lightweight Python library for comparing gene-level signals across multiple GWAS summary-statistics files. It provides significant-gene extraction, pairwise overlap analysis, Jaccard similarity, cross-dataset consensus scoring, report generation, similarity heatmaps, and Circos-style GWAS visualization.
Developed by Amir Izadi and Zahra Aghabeygi.
Tardigrade GWAS is a research utility. It is not a clinical diagnostic system and should not be used as the sole basis for medical decisions.
Features
- Read CSV, TSV, TXT,
.csv.gz, and.tsv.gzGWAS tables - Detect common GWAS column aliases case-insensitively
- Filter associations by a configurable P-value threshold
- Parse multi-gene fields separated by commas or semicolons
- Count significant-gene occurrences per dataset
- Calculate gene evidence scores using
sum(-log10(P)) - Calculate pairwise gene-set intersections
- Calculate pairwise Jaccard similarity
- Build a cross-dataset consensus table
- Find genes shared across all datasets
- Generate text reports and CSV result tables
- Generate Jaccard similarity heatmaps
- Generate Circos-style chromosome plots
- Return standard Pandas DataFrames and Matplotlib Figures for further customization
Installation
pip install tardigrade-gwas
Upgrade to the latest available version:
pip install --upgrade tardigrade-gwas
Check the installed version:
import tardigrade_gwas as tg
print(tg.__version__)
Quick start
import tardigrade_gwas as tg
result = tg.analyze_files(
["disease_a.csv", "disease_b.tsv"],
names=["Disease A", "Disease B"],
)
print(result.report_text)
print(result.similarity_table)
print(result.consensus_table.head())
Save the generated report and all result tables:
result.save_report("outputs/report.txt")
result.save_tables("outputs/tables")
Input data
Supported file types
.csv.tsv.txt.csv.gz.tsv.gz
For .txt files or files without a recognized CSV/TSV extension, Tardigrade attempts to detect comma or tab delimiters automatically.
Minimum columns for overlap analysis
The default analysis columns are:
| Purpose | Default column |
|---|---|
| P value | P-VALUE |
| Gene annotation | MAPPED_GENE |
Example:
P-VALUE,MAPPED_GENE
1.2e-10,TP53
4.8e-9,BRCA1
2.3e-6,APOE
7.1e-12,"IL6;STAT3"
Only rows satisfying P-value <= p_threshold are included in the significant-gene analysis. The default threshold is 5e-8.
Columns required for Circos-style plots
| Purpose | Default column |
|---|---|
| Chromosome | CHR_ID |
| Base-pair position | CHR_POS |
| P value | P-VALUE |
| Gene annotation | MAPPED_GENE |
Example:
P-VALUE,MAPPED_GENE,CHR_ID,CHR_POS
1.2e-10,TP53,17,7676594
4.8e-9,BRCA1,17,43044295
3.1e-12,APOE,19,44908684
Recognized column aliases
Column matching is case-insensitive.
| Data type | Recognized names |
|---|---|
| P value | P-VALUE, P, PVALUE, P_VALUE |
| Gene | MAPPED_GENE, REPORTED GENE(S), GENE, GENES |
| Chromosome | CHR_ID, CHR, CHROM, CHROMOSOME |
| Position | CHR_POS, BP, POS, POSITION, BASE_PAIR_LOCATION |
Custom column names can always be passed explicitly.
Main analysis API
analyze_files
Use analyze_files when you have a list of file paths.
from pathlib import Path
import tardigrade_gwas as tg
result = tg.analyze_files(
files=[
Path("data/alzheimer.csv"),
Path("data/parkinson.csv"),
Path("data/multiple_sclerosis.tsv"),
],
names=[
"Alzheimer Disease",
"Parkinson Disease",
"Multiple Sclerosis",
],
p_column="P-VALUE",
gene_column="MAPPED_GENE",
p_threshold=5e-8,
top_n=20,
)
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
files |
sequence of paths | required | GWAS files to analyze |
names |
sequence of strings or None |
None |
Display names for the datasets |
p_column |
string | P-VALUE |
Preferred P-value column |
gene_column |
string | MAPPED_GENE |
Preferred gene column |
p_threshold |
float | 5e-8 |
Significance threshold in (0, 1] |
top_n |
integer | 20 |
Number of top genes shown in report sections |
When names is omitted, names are derived from filenames. Duplicate derived names receive suffixes such as _2, _3, and so on.
analyze_datasets
Use analyze_datasets when each dataset already has an explicit display name.
import tardigrade_gwas as tg
result = tg.analyze_datasets(
[
("Alzheimer Disease", "data/ad.csv"),
("Parkinson Disease", "data/pd.csv"),
],
p_threshold=5e-8,
)
Dataset names must be non-empty and unique.
run_multi_gwas
run_multi_gwas is the lower-level analysis function. It returns a dictionary rather than a MultiGWASResult object.
raw_result = tg.run_multi_gwas(
datasets=[
("Disease A", "a.csv"),
("Disease B", "b.csv"),
],
p_column="P-VALUE",
gene_column="MAPPED_GENE",
p_threshold=5e-8,
top_n=20,
)
print(raw_result["report_text"])
For normal use, analyze_files or analyze_datasets is recommended.
Understanding the result object
analyze_files and analyze_datasets return a MultiGWASResult object.
result = tg.analyze_files(["a.csv", "b.csv"])
The object contains:
result.dataset_names
result.report_text
result.similarity_table
result.consensus_table
result.all_common_table
result.pairwise_full_tables
dataset_names
Tuple containing dataset names in input order:
print(result.dataset_names)
report_text
A complete plain-text report containing:
- original row counts
- valid P-value counts
- zero P-value counts
- significant row counts
- unique significant-gene counts
- top genes by occurrence count
- top genes by weighted evidence score
- pairwise overlaps
- Jaccard similarity values
- cross-dataset consensus results
- genes shared by all datasets
print(result.report_text)
Save it:
path = result.save_report("outputs/multi_gwas_report.txt")
print(path)
similarity_table
Pairwise Jaccard similarities between significant-gene sets:
print(result.similarity_table)
The Jaccard similarity is:
Jaccard(A, B) = |A intersection B| / |A union B|
Its value ranges from 0 to 1:
0: no shared significant genes1: identical significant-gene sets
The table has the columns:
A, B, jaccard
consensus_table
A gene-level table combining evidence across all input datasets:
print(result.consensus_table.head(20))
For every dataset, it contains:
<dataset>_count: number of significant rows containing the gene<dataset>_wscore: sum of-log10(P)for rows containing the gene
It also contains:
presence: number of datasets in which the gene is significanttotal_count: total significant-row count across all datasetstotal_wscore: total weighted score across all datasets
The table is sorted by presence, then total_wscore, then total_count, all in descending order.
Filter genes present in at least two datasets:
shared = result.consensus_table[
result.consensus_table["presence"] >= 2
]
print(shared)
Get the ten genes with the strongest combined weighted evidence:
top_weighted = result.consensus_table.nlargest(10, "total_wscore")
print(top_weighted)
all_common_table
Genes found in every input dataset:
if result.all_common_table.empty:
print("No gene is shared by all datasets.")
else:
print(result.all_common_table)
This table is only applicable when at least two datasets are analyzed.
pairwise_full_tables
Dictionary of complete pairwise overlap tables:
for (name_a, name_b), table in result.pairwise_full_tables.items():
print(f"{name_a} vs {name_b}")
print(table.head())
Access one specific comparison:
pair_table = result.pairwise_full_tables[
("Disease A", "Disease B")
]
If the two datasets have no shared significant genes, the corresponding DataFrame is empty.
Convert the result to a dictionary
result_dict = result.as_dict()
The returned dictionary contains:
report_text
similarity_table
consensus_table
all_common_table
pairwise_full_tables
Save all result tables
written_files = result.save_tables("outputs/tables")
for path in written_files:
print(path)
Output structure:
outputs/tables/
├── similarity.csv
├── consensus.csv
├── all_common.csv
└── pairwise/
├── Disease_A__Disease_B.csv
└── Disease_A__Disease_C.csv
Similarity heatmap
High-level API
import matplotlib.pyplot as plt
import tardigrade_gwas as tg
result = tg.analyze_files(
["disease_a.csv", "disease_b.csv", "disease_c.csv"],
names=["Disease A", "Disease B", "Disease C"],
)
figure = tg.plot_similarity_heatmap(
result,
title="Disease similarity based on significant genes",
figsize=(9, 7),
)
plt.show()
The plot is not saved automatically. Save it only when needed:
figure.savefig(
"outputs/similarity_heatmap.png",
dpi=300,
bbox_inches="tight",
)
Other formats supported by Matplotlib include PDF and SVG:
figure.savefig("outputs/heatmap.pdf", bbox_inches="tight")
figure.savefig("outputs/heatmap.svg", bbox_inches="tight")
You can pass a custom ordering or subset of names:
figure = tg.plot_similarity_heatmap(
result,
names=["Disease C", "Disease A", "Disease B"],
)
Build the similarity matrix without plotting
matrix = tg.build_similarity_matrix(
result.similarity_table,
result.dataset_names,
)
print(matrix)
Draw on an existing Matplotlib axis
import matplotlib.pyplot as plt
import tardigrade_gwas as tg
fig, ax = plt.subplots(figsize=(8, 6))
tg.draw_similarity_heatmap(
ax,
similarity_table=result.similarity_table,
names=result.dataset_names,
title="Disease similarity (Jaccard)",
)
fig.tight_layout()
plt.show()
Circos-style GWAS plot
High-level API
import matplotlib.pyplot as plt
import tardigrade_gwas as tg
figure = tg.plot_circos(
"data/disease.csv",
chromosome_column="CHR_ID",
position_column="CHR_POS",
p_column="P-VALUE",
gene_column="MAPPED_GENE",
p_threshold=5e-8,
maximum_points=120_000,
label_top_n=12,
random_seed=2026,
figsize=(10, 10),
)
plt.show()
Save the plot:
figure.savefig(
"outputs/disease_circos.png",
dpi=300,
bbox_inches="tight",
)
Parameters:
| Parameter | Default | Description |
|---|---|---|
chromosome_column |
CHR_ID |
Chromosome column |
position_column |
CHR_POS |
Base-pair position column |
p_column |
P-VALUE |
P-value column |
gene_column |
MAPPED_GENE |
Gene column used for labels |
p_threshold |
5e-8 |
Significance threshold ring |
maximum_points |
120000 |
Maximum plotted points after downsampling |
label_top_n |
12 |
Number of top unique genes prepared for annotation |
random_seed |
2026 |
Seed controlling deterministic schematic chromosome bands |
figsize |
(9, 9) |
Figure size in inches |
Set label_top_n=0 to disable gene annotations.
Draw on an existing polar axis
import matplotlib.pyplot as plt
import tardigrade_gwas as tg
fig, ax = plt.subplots(
figsize=(10, 10),
subplot_kw={"projection": "polar"},
)
label_artists = tg.draw_circular_on_ax(
gwas_path="data/disease.csv",
ax=ax,
p_threshold=5e-8,
maximum_points=120_000,
label_top_n=12,
)
fig.tight_layout()
plt.show()
draw_circular_on_ax returns a list of GeneLabelArtist objects. Their annotations are initially hidden. To display all generated labels:
for artist in label_artists:
artist.annotation.set_visible(True)
fig.canvas.draw_idle()
Circos data rules
- Supported chromosomes:
1through22,X, andY - Values such as
chr1,chrX, and1.0are normalized - Positions must be numeric, greater than zero, and within the hg38 chromosome length
- P values must be numeric and between
0and1 - The plot uses hg38 chromosome lengths
- The chromosome band pattern is schematic and is not a true cytoband ideogram
maximum_pointsreduces the number of plotted points, but the file is still read into memory before downsampling
Read a GWAS table directly
import tardigrade_gwas as tg
df = tg.read_any_table("data/disease.tsv.gz")
print(df.head())
print(df.columns.tolist())
print(df.shape)
The function returns a Pandas DataFrame.
Gene parsing and utility functions
Split a gene field
import tardigrade_gwas as tg
genes = tg.split_genes("TP53; BRCA1, APOE")
print(genes)
Output:
["TP53", "BRCA1", "APOE"]
Commas and semicolons are supported. Duplicate genes within one cell are removed while original order is preserved.
The following tokens are treated as missing values:
"", -, ., NA, N/A, NAN, NONE, NULL, NR
Extract a unique gene set
gene_set = tg.extract_gene_set(df["MAPPED_GENE"])
print(len(gene_set))
Count gene occurrences
counts = tg.extract_gene_counts(df["MAPPED_GENE"])
print(counts.head(20))
Calculate weighted gene scores
scores = tg.extract_gene_weighted_scores(
df,
p_column="P-VALUE",
gene_column="MAPPED_GENE",
)
print(scores.head(20))
For each valid row, the weight is:
-log10(P-value)
If a row contains multiple genes, the complete row weight is assigned to every listed gene; it is not divided among them.
Calculate Jaccard similarity directly
set_a = {"TP53", "APOE", "BRCA1"}
set_b = {"TP53", "APOE", "IL6"}
score = tg.jaccard_similarity(set_a, set_b)
print(score) # 0.5
If both sets are empty, the function returns 0.0.
Custom column names
Suppose your table uses:
PVALUE, GENE_SYMBOL, CHROMOSOME, BASE_PAIR
Analyze it with:
result = tg.analyze_files(
["custom.csv"],
p_column="PVALUE",
gene_column="GENE_SYMBOL",
)
Create a Circos-style plot with:
figure = tg.plot_circos(
"custom.csv",
p_column="PVALUE",
gene_column="GENE_SYMBOL",
chromosome_column="CHROMOSOME",
position_column="BASE_PAIR",
)
Error handling
import tardigrade_gwas as tg
try:
result = tg.analyze_files(["a.csv", "b.csv"])
except FileNotFoundError as error:
print(f"File error: {error}")
except ValueError as error:
print(f"Input error: {error}")
Common errors include:
- file not found
- empty input file
- malformed CSV/TSV table
- missing required column
- duplicate explicit dataset name
- number of supplied names not matching the number of files
- invalid
p_threshold - invalid
top_n - no valid chromosome/position/P-value rows for Circos plotting
Complete workflow example
from pathlib import Path
import matplotlib.pyplot as plt
import tardigrade_gwas as tg
output_dir = Path("outputs")
output_dir.mkdir(parents=True, exist_ok=True)
files = [
Path("data/alzheimer.csv"),
Path("data/parkinson.csv"),
Path("data/multiple_sclerosis.tsv"),
]
names = [
"Alzheimer Disease",
"Parkinson Disease",
"Multiple Sclerosis",
]
result = tg.analyze_files(
files,
names=names,
p_column="P-VALUE",
gene_column="MAPPED_GENE",
p_threshold=5e-8,
top_n=20,
)
print(result.report_text)
result.save_report(output_dir / "multi_gwas_report.txt")
result.save_tables(output_dir / "tables")
heatmap = tg.plot_similarity_heatmap(
result,
title="GWAS gene-set similarity",
figsize=(9, 7),
)
heatmap.savefig(
output_dir / "similarity_heatmap.png",
dpi=300,
bbox_inches="tight",
)
for file_path, dataset_name in zip(files, names, strict=True):
safe_name = dataset_name.lower().replace(" ", "_")
circos = tg.plot_circos(
file_path,
p_threshold=5e-8,
maximum_points=120_000,
label_top_n=12,
figsize=(10, 10),
)
circos.savefig(
output_dir / f"{safe_name}_circos.png",
dpi=300,
bbox_inches="tight",
)
plt.show()
Public API summary
Analysis
tg.analyze_files
tg.analyze_datasets
tg.run_multi_gwas
Input
tg.read_any_table
Gene utilities
tg.split_genes
tg.extract_gene_set
tg.extract_gene_counts
tg.extract_gene_weighted_scores
tg.jaccard_similarity
Plotting
tg.plot_similarity_heatmap
tg.plot_circos
tg.build_similarity_matrix
tg.draw_similarity_heatmap
tg.draw_circular_on_ax
Defaults
tg.DEFAULT_P_THRESHOLD
tg.DEFAULT_P_COL
tg.DEFAULT_GENE_COL
tg.DEFAULT_CHR_COL
tg.DEFAULT_BP_COL
tg.DEFAULT_MAX_PLOT_POINTS
Scientific and technical limitations
- Gene symbols are compared as exact strings.
- Gene aliases are not harmonized automatically; for example,
PARK2andPRKNremain distinct. - Gene-symbol capitalization is not normalized;
TP53andtp53may be treated as different genes. - Genome-build conversion is not performed.
- Circos positions are validated against hg38 chromosome lengths.
- Circos chromosome bands are schematic, not true cytogenetic bands.
- Zero P values are replaced internally by the smallest positive floating-point value when calculating
-log10(P); this can produce very large weighted scores. - Tables are loaded into memory with Pandas; very large GWAS files may require substantial RAM.
- The package does not currently perform LD clumping, meta-analysis, pathway enrichment, gene ontology analysis, SNP annotation, liftover, QQ plots, or Manhattan plots.
- Outputs should be interpreted in the context of the source datasets and their original study designs.
Authors
- Amir Izadi
- Zahra Aghabeygi
License
Tardigrade GWAS is distributed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tardigrade_gwas-1.0.1.tar.gz.
File metadata
- Download URL: tardigrade_gwas-1.0.1.tar.gz
- Upload date:
- Size: 26.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
422efb5219de6b0004de724a72b7d3ce7230c0555ccf82b2228a8b3f7ba847f2
|
|
| MD5 |
42c7cdf12e0be5abeb494f0f45f4043c
|
|
| BLAKE2b-256 |
57f469bbf6b6bb005d65957a92368a71fb0a6f4db74708041cc2cab31d02ad9d
|
File details
Details for the file tardigrade_gwas-1.0.1-py3-none-any.whl.
File metadata
- Download URL: tardigrade_gwas-1.0.1-py3-none-any.whl
- Upload date:
- Size: 24.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77984db65acea2e4b7d8ffb1a72ecd448ba62fb3638f00715dbadadee7babe26
|
|
| MD5 |
e158f0b38eeb7a37f2e7aa3af231ed05
|
|
| BLAKE2b-256 |
34e704e930e41251288d066bd3e6bf6f98509d17652a6c013031cf0e354394be
|