Analyze centromere variation from short-read data using rare k-mers in alpha satellite sequences

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

friend1ws

These details have not been verified by PyPI

Project description

Maruyama cairn at Mt. Karamatsu

ascairn

ascairn (alpha-satellite cairn) is software for estimating centromere variation from short-read sequencing data using rare k-mers within centromere sequences. For each chromosome, ascairn identifies the most likely pair of active alpha satellite higher-order repeat (HOR) haplogroups and the nearest proxy haplotypes from a reference panel (Shiraishi et al., bioRxiv, 2025).

ascairn accepts BAM/CRAM files aligned to either GRCh38 (hg38) or T2T-CHM13 (chm13). Centromere region BED files for each alignment reference are included in the resource repository.

Background

Human centromeres are composed largely of chromosome-specific alpha satellite higher-order repeat (HOR) arrays. The active alpha satellite HOR arrays (aHOR arrays), which are associated with CENP-A and kinetochore formation, show extensive sequence and structural variation among individuals. Because aHOR arrays are long and highly repetitive, they have historically been difficult to analyze with conventional short-read sequencing.

Active alpha satellite HOR arrays and structural diversity among centromere haplotypes

aHOR-haps comprise the aHOR array and flanking regions, and vary in size, KAS position, HOR pattern, and structural variants.

Long-read assemblies have recently revealed many complete centromeric haplotypes. However, applying long-read sequencing to thousands of population-scale or clinical samples remains costly and often impractical. ascairn addresses this gap by using rare k-mers within centromeric alpha satellite arrays as short-read-detectable markers of centromere haplotype structure.

What is a centromere haplogroup?

In ascairn, a centromere haplogroup is a chromosome-specific cluster of active alpha satellite HOR haplotypes (aHOR-haps) that share similar rare k-mer profiles, which serve as lineage-specific markers. These haplogroups are inferred from a reference panel of assembled aHOR-haps (the reference aHOR-hap panel) and often correspond to evolutionarily related centromere lineages with distinct structural features, such as differences in HOR organization, array size, or large structural variants.

Centromere haplogroups are assigned separately for each chromosome, describing variation at each individual centromere.

Rare k-mer based classification of centromere haplotypes

For more background, see Centromere haplogroups.

What ascairn does

ascairn infers aHOR-HGs from short-read whole-genome sequencing data. It extracts reads aligned to centromeric alpha satellite regions, generates a predefined rare k-mer count profile, and uses a probabilistic model to identify the most likely aHOR-HG pair for each chromosome. It also reports the closest proxy aHOR-hap pair from the reference aHOR-hap panel.

The reference aHOR-hap panel was constructed in advance by extracting centromeric sequences from publicly available long-read assemblies (HPRC, HGSVC, T2T Consortium) and is distributed through the ascairn_resource repository.

The main outputs are:

the best-matching aHOR-HG (cluster) pair,
the nearest proxy aHOR-hap pair,
marker-level probability tables supporting the assignment.

ascairn does not assemble centromeres de novo. Its results depend on sequencing depth, read alignment quality, and the representation of related aHOR-haps in the supplied reference panel.

For more on the applications of ascairn, see Applications.

Overview of the ascairn workflow. Short-read WGS data are processed to count rare k-mers in centromeric regions; a probabilistic model is then used to infer the most likely aHOR-HG pair and the nearest proxy aHOR-hap pair from the reference panel.

Prerequisites

Software

Python packages

click
scipy
polars
boto3 (required only for accessing CRAM files in Amazon S3)

Installation

Install prerequisite software and ensure they are accessible via your PATH.
Install ascairn from PyPI.

pip install ascairn

To enable access to CRAM files on Amazon S3, install with the s3 extra:

pip install ascairn[s3]

Or, to install from source (for development):

git clone https://github.com/friend1ws/ascairn.git
cd ascairn
pip install -e .

Download ascairn resource files. This repository contains the reference data required by ascairn, including:
- Rare k-mer list for alpha satellite sequences (rare_kmer_list.fa)
- Centromere region BED files for hg38 and chm13
- Per-chromosome k-mer information (kmer_info/)
- Per-chromosome haplotype-to-cluster mapping (hap_info/)

git clone https://github.com/friend1ws/ascairn_resource.git

After installation, your directory structure should look like this:

ascairn/
ascairn_resource/
└── resource/
    ├── common/
    │   ├── cen_region_curated_margin_hg38.bed
    │   ├── cen_region_curated_margin_chm13.bed
    │   ├── chr22_long_arm_hg38.bed
    │   ├── chr22_long_arm_chm13.bed
    │   ├── chrX_short_arm_hg38.bed
    │   └── chrX_short_arm_chm13.bed
    └── panel/
        └── ascairn_paper_2025/
            ├── rare_kmer_list.fa
            ├── kmer_info/
            │   ├── chr1.kmer_info.txt.gz
            │   ├── ...
            │   └── chrX.kmer_info.txt.gz
            └── hap_info/
                ├── chr1.hap_info.txt
                ├── ...
                └── chrX.hap_info.txt

Quick Start

1. Prepare the sequence data

We use NA12877 from the 1000 Genomes Project, whose GRCh38-aligned CRAM is publicly available on AWS S3 and FTP.

Option 1: Direct S3 path (no download required)

If samtools is properly installed with S3 support, ascairn can read CRAM files directly from S3 without downloading. The file is:

s3://1000genomes/1000G_2504_high_coverage/additional_698_related/data/ERR3989340/NA12877.final.cram

Option 2: Download locally

Either via AWS CLI (public access, no AWS credentials required):

aws s3 cp --no-sign-request s3://1000genomes/1000G_2504_high_coverage/additional_698_related/data/ERR3989340/NA12877.final.cram seq_data/
aws s3 cp --no-sign-request s3://1000genomes/1000G_2504_high_coverage/additional_698_related/data/ERR3989340/NA12877.final.cram.crai seq_data/

Or via FTP:

wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR398/ERR3989340/NA12877.final.cram -P seq_data/
wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR398/ERR3989340/NA12877.final.cram.crai -P seq_data/

2. Run the workflow

The simplest way to run ascairn is via type_all, which executes all steps (check_depth, kmer_count, cen_type) for chr1-22, chrX, and chrY (males only) in a single command.

Using Option 1 (direct S3 path):

ascairn type_all \
    s3://1000genomes/1000G_2504_high_coverage/additional_698_related/data/ERR3989340/NA12877.final.cram \
    -o output/NA12877 \
    --resource_dir ascairn_resource/resource/panel/ascairn_paper_2025 \
    --reference hg38 \
    -t 8

Using Option 2 (downloaded CRAM):

ascairn type_all \
    seq_data/NA12877.final.cram \
    -o output/NA12877 \
    --resource_dir ascairn_resource/resource/panel/ascairn_paper_2025 \
    --reference hg38 \
    -t 8

CRAM with a local reference fasta:

If the CRAM header's UR/MD5 cannot be resolved (network-isolated nodes, locally produced CRAMs, etc.), pass the reference fasta with -r/--reference_fasta. This is forwarded to samtools/mosdepth and works for check_depth, kmer_count, and type_all:

ascairn type_all \
    seq_data/NA12877.final.cram \
    -o output/NA12877 \
    --resource_dir ascairn_resource/resource/panel/ascairn_paper_2025 \
    --reference hg38 \
    -r /path/to/GRCh38.fasta \
    -t 8

3. Check the results

After successful execution, the main output file is:

output/NA12877.cen_type_all.txt

See Output Format for how to interpret the results.

Running with CHM13-aligned data

For users with CHM13-aligned data, here is an equivalent example using NA12878, whose CHM13-aligned CRAM is publicly available at the DDBJ mirror.

Option 1: Direct HTTPS URL (no download required)

If samtools is properly installed with libcurl support, ascairn can read CRAM files directly from HTTPS URLs:

ascairn type_all \
    https://ddbj.nig.ac.jp/public/public-human-genomes/CHM13/1000Genomes/CRAM/NA12878/NA12878.cram \
    -o output/NA12878 \
    --resource_dir ascairn_resource/resource/panel/ascairn_paper_2025 \
    --reference chm13 \
    -t 8

Option 2: Download locally

wget https://ddbj.nig.ac.jp/public/public-human-genomes/CHM13/1000Genomes/CRAM/NA12878/NA12878.cram -P seq_data/
wget https://ddbj.nig.ac.jp/public/public-human-genomes/CHM13/1000Genomes/CRAM/NA12878/NA12878.cram.crai -P seq_data/

ascairn type_all \
    seq_data/NA12878.cram \
    -o output/NA12878 \
    --resource_dir ascairn_resource/resource/panel/ascairn_paper_2025 \
    --reference chm13 \
    -t 8

The only differences from the GRCh38 workflow are (i) the input CRAM aligned to CHM13 and (ii) --reference chm13 in the command.

Commands

`type_all`

Runs the full workflow (check_depth → kmer_count → cen_type for all chromosomes) in a single command. Automatically determines biological sex and applies single-haplotype mode for chrX in males.

ascairn type_all \
    seq_data/NA12877.final.cram \
    -o output/NA12877 \
    --resource_dir ascairn_resource/resource/panel/ascairn_paper_2025 \
    --reference hg38 \
    -t 8

Option	Description	Required
`BAM_FILE`	Path to BAM or CRAM file (positional argument)	Yes
`-o`	Output path prefix	Yes
`--resource_dir`	Path to panel resource directory	Yes
`--reference`	Reference genome build (`hg38` or `chm13`)	Yes
`-t`	Number of threads	No (default: 8)

Individual commands

The workflow consists of three commands that are normally run in order. You can run them individually for more control.

`check_depth`

Checks sequence coverage in a reference region (chr22 long arm) and determines biological sex by comparing chrX coverage.

ascairn check_depth \
    seq_data/NA12877.final.cram \
    -o output/NA12877.depth.txt \
    --baseline_region ascairn_resource/resource/common/chr22_long_arm_hg38.bed \
    --x_region ascairn_resource/resource/common/chrX_short_arm_hg38.bed \
    -t 8

Output (output/NA12877.depth.txt):

Coverage: 37.24
Baseline region file: ascairn_resource/resource/common/chr22_long_arm_hg38.bed
Sex: male
ChrX region file: ascairn_resource/resource/common/chrX_short_arm_hg38.bed
ChrX coverage: 18.0

`kmer_count`

Extracts reads aligned to alpha satellite regions and counts occurrences of predefined rare k-mers.

ascairn kmer_count \
    seq_data/NA12877.final.cram \
    -o output/NA12877.kmer_count.txt \
    --kmer_file ascairn_resource/resource/panel/ascairn_paper_2025/rare_kmer_list.fa \
    --cen_region ascairn_resource/resource/common/cen_region_curated_margin_hg38.bed \
    -t 8

Output (output/NA12877.kmer_count.txt): a two-column TSV with k-mer sequences and their counts.

`cen_type`

Identifies the most likely centromere cluster pair and nearest haplotype pair for a given chromosome. Requires the depth file from check_depth output.

ascairn cen_type \
    output/NA12877.kmer_count.txt \
    -o output/NA12877.chr22 \
    --kmer_info ascairn_resource/resource/panel/ascairn_paper_2025/kmer_info/chr22.kmer_info.txt.gz \
    --hap_info ascairn_resource/resource/panel/ascairn_paper_2025/hap_info/chr22.hap_info.txt \
    --depth_file output/NA12877.depth.txt

For male samples on chrX, add the --single_hap option (type_all handles this automatically based on check_depth output).

Output files (using output/NA12877.chr22 as the output prefix):

File	Description
`*.cen_type.txt`	Best cluster pair, haplotype pair, and hap_info annotations (e.g., Contig_len)
`*.cluster.hap_pair.txt`	Ranked cluster pairs with log-likelihoods. The first data row is the best pair.
`*.haplotype.hap_pair.txt`	Ranked haplotype pairs with log-likelihoods. The first data row is the best pair.
`*.cluster.marker_prob.txt`	Per-marker copy number probabilities used for cluster assignment
`*.haplotype.marker_prob.txt`	Per-marker copy number probabilities used for haplotype assignment

Output Format

`cen_type_all.txt`

The main result file (generated by type_all) is a TSV with one row per chromosome:

Column	Description
Chr	Chromosome (chr1-chr22, chrX, chrY)
Cluster_1	Best-matching cluster ID for the first allele
Cluster_2	Best-matching cluster ID for the second allele
Haplotype_1	Nearest haplotype ID for the first allele (e.g., `HG00268.mat`)
Haplotype_2	Nearest haplotype ID for the second allele
additional columns	Any extra columns from `hap_info.txt` are automatically appended with `_1`/`_2` suffixes (e.g., `Contig_len_1`, `Contig_len_2`)

For male chrX, _2 columns are NA.

Per-chromosome `cen_type.txt`

Each cen_type run produces a *.cen_type.txt file with the same columns as above (without the Chr column). The cen_type_all.txt file is the concatenation of these per-chromosome files.

Cluster and haplotype pair files

These files list all candidate pairs ranked by log-likelihood (higher = better fit):

Column	Description
Cluster1 / Haplotype1	First member of the pair
Cluster2 / Haplotype2	Second member of the pair
Loglikelihood	Log-likelihood of the pair given the observed k-mer counts

Performance

We evaluated ascairn using leave-one-individual-out cross-validation on the reference panel: for each individual, both parental aHOR-haps were removed from the panel, and the resulting model was used to infer the haplogroup pair from the individual's short-read WGS data. To assess robustness to sequencing depth, each sample was downsampled to 1–30x coverage.

The figures below show haplogroup pair assignment accuracy for each chromosome at various downsampled coverages, for short-read WGS data aligned to GRCh38 (upper) and CHM13 v2.0 (lower):

GRCh38-aligned data:

CHM13-aligned data:

Accuracy is generally high (>90%) for most chromosomes at coverage ≥ 5x, with comparable performance between GRCh38- and CHM13-aligned data. See Shiraishi et al., bioRxiv, 2025 for details.

Notes

chrY: chrY is processed only for male samples (determined automatically by check_depth from the chrX coverage ratio).
Test data: The Quick Start example uses NA12877 from the 1000 Genomes Project high-coverage dataset. This sample was chosen because it is publicly accessible via both AWS S3 and FTP, enabling reproducible testing without restricted data access.

Citation

Shiraishi Y, Ochi Y, Sugawa M, Sakamoto Y, Kimura K, Tsujimura T, Okada A, Okuda R, Namba S, Miyauchi T, Mateos RN, Suzuki H, Chiba K, Ito Y, Nakamura W, Ohka F, Motomura K, Yamamoto T, Kawai Y, Okada Y, Suzuki H, Kato M, Saito R, Garrison E, Logsdon GA, Ogawa S. Rare k-mers reveal centromere haplogroups underlying human diversity and cancer translocations. bioRxiv. 2025. doi: 10.1101/2025.07.26.666712

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

friend1ws

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.0

Jun 6, 2026

0.3.0b1 pre-release

May 30, 2026

0.2.1

May 30, 2026

0.2.1b1 pre-release

May 29, 2026

This version

0.1.2

May 29, 2026

0.1.1

May 29, 2026

0.1.0

May 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ascairn-0.1.2.tar.gz (37.4 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ascairn-0.1.2-py3-none-any.whl (33.9 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file ascairn-0.1.2.tar.gz.

File metadata

Download URL: ascairn-0.1.2.tar.gz
Upload date: May 29, 2026
Size: 37.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ascairn-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`c4f929cfabd6fdb2ef9d4f664790e05a40f6de84f45739885819900615ac0c53`
MD5	`f5f534c11933ebafcc91f4ca6ac253ef`
BLAKE2b-256	`f810c0fb61f31ff74e6c40c9c7114d434796da9816c0a299db67ac6f248b15d1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ascairn-0.1.2.tar.gz:

Publisher: python-publish.yml on friend1ws/ascairn

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ascairn-0.1.2.tar.gz
- Subject digest: c4f929cfabd6fdb2ef9d4f664790e05a40f6de84f45739885819900615ac0c53
- Sigstore transparency entry: 1665647694
- Sigstore integration time: May 29, 2026
Source repository:
- Permalink: friend1ws/ascairn@3eb50bdd8865f3e2a6a20907cf1c190aa71d4d33
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/friend1ws
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@3eb50bdd8865f3e2a6a20907cf1c190aa71d4d33
- Trigger Event: release

File details

Details for the file ascairn-0.1.2-py3-none-any.whl.

File metadata

Download URL: ascairn-0.1.2-py3-none-any.whl
Upload date: May 29, 2026
Size: 33.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ascairn-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1751bcaf71643c78eb9c0927a0afee63bcb042c1ff420ecfd444e8efa50b3bc7`
MD5	`e90feb4b246a0c67073e7eec5b3333e7`
BLAKE2b-256	`df6c9994e8a8190edce653c13452e0c0356ffef7ff04e9b436154d3a01e83f85`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ascairn-0.1.2-py3-none-any.whl:

Publisher: python-publish.yml on friend1ws/ascairn

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ascairn-0.1.2-py3-none-any.whl
- Subject digest: 1751bcaf71643c78eb9c0927a0afee63bcb042c1ff420ecfd444e8efa50b3bc7
- Sigstore transparency entry: 1665647828
- Sigstore integration time: May 29, 2026
Source repository:
- Permalink: friend1ws/ascairn@3eb50bdd8865f3e2a6a20907cf1c190aa71d4d33
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/friend1ws
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@3eb50bdd8865f3e2a6a20907cf1c190aa71d4d33
- Trigger Event: release

ascairn 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

ascairn

Background

What is a centromere haplogroup?

What ascairn does

Prerequisites

Software

Python packages

Installation

Quick Start

1. Prepare the sequence data

2. Run the workflow

3. Check the results

Running with CHM13-aligned data

Commands

type_all

Individual commands

check_depth

kmer_count

cen_type

Output Format

cen_type_all.txt

Per-chromosome cen_type.txt

Cluster and haplotype pair files

Performance

Notes

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`type_all`

`check_depth`

`kmer_count`

`cen_type`

`cen_type_all.txt`

Per-chromosome `cen_type.txt`