De Novo Decomposition of Satellite DNA Arrays into Monomers within Telomere-to-Telomere Assemblies
Project description
ArraySplitter: De Novo Decomposition of Satellite DNA Arrays
Decomposes satellite DNA arrays into monomers within telomere-to-telomere (T2T) assemblies. Ideal for analyzing centromeric and pericentromeric regions on monomeric level.
Status: Production ready. Successfully handles arrays from kilobase to megabase scale.
Key Features:
- De novo monomer identification without prior knowledge
- Autocorrelation-based period detection for robust periodicity analysis
- Automatic orientation to canonical form (A>T, C>G)
- Deterministic output sorted by genomic coordinates
- Multi-threaded processing
Performance: CHM13v2.0 assembly (~1300 alpha satellite arrays) processes in ~3.5 minutes (16 threads)
Installation
pip install arraysplitter
Or build from source:
cd src/rust/arraysplitter
cargo build --release
Quick Start
# Basic decomposition
arraysplitter -i arrays.fa -o output_prefix -t 16
# With predefined cut sequences
arraysplitter -i arrays.fa -o output_prefix -c ATG,CGCG -t 16
# Show version
arraysplitter --version
Output Files
All output is deterministically sorted by chromosome and genomic position (chr1 → chr22 → chrX → chrY → chrM).
| File | Description |
|---|---|
.decomposed.fasta |
Monomers with orientation info in headers |
.hors.tsv |
HOR-level decomposition (16 columns) |
.monomers.tsv |
Base-level monomers from recursive decomposition (17 columns) |
.summary.tsv |
One-row-per-array summary with HOR and monomer statistics (23 columns) |
.lengths |
Fragment lengths for each array |
Summary TSV Columns (.summary.tsv)
One row per array combining HOR-level and monomer-level statistics. Useful for overview analysis.
| Column | Description |
|---|---|
array_id |
Array identifier (chr_start_end_len_period_type) |
array_length |
Total array length in bp |
orientation |
fwd or rev (reverse complemented to canonical) |
method |
Detection method used (autocorr, classic) |
| HOR-level stats | |
hor_period |
Detected HOR period in bp |
hor_autocorr |
Autocorrelation at HOR period |
hor_n_monomers |
Number of HOR-level monomers |
hor_mean_ed_tmpl |
Mean edit distance to HOR consensus |
hor_mean_ed_prev |
Mean edit distance between adjacent HORs |
hor_cv |
Coefficient of variation for HOR lengths |
hor_consensus |
Consensus sequence at HOR level |
hor_iupac |
IUPAC ambiguity codes (bases ≥20% frequency) |
hor_quality |
Per-position support (digit 0-9, 9=90-100%) |
| Monomer-level stats | |
mono_period |
Median base monomer period |
mono_autocorr |
Mean autocorrelation at monomer level |
mono_n_monomers |
Total number of base monomers |
mono_mean_ed_tmpl |
Mean edit distance to monomer consensus |
mono_mean_ed_prev |
Mean edit distance between adjacent monomers |
mono_cv |
Mean coefficient of variation |
mono_consensus |
Consensus sequence at monomer level |
mono_iupac |
IUPAC ambiguity codes |
mono_quality |
Per-position support |
cut_sequence |
Anchor k-mer used for splitting |
HORs TSV Columns (.hors.tsv)
Contains the primary decomposition into HOR (Higher Order Repeat) monomers. Multiple rows per array.
Row types (in order):
pred_array- Array-level prediction/header rowflank- Terminal fragments <70% of periodmonomer- Full HOR monomers (sorted by idx)array- Summary statistics rowconsensus- Consensus sequence row
| Column | Description |
|---|---|
array_id |
Array identifier (chr_start_end_len_period_type) |
type |
pred_array, monomer, flank, array, consensus |
idx |
Monomer index within array (0-based) |
length |
Sequence length in bp |
source |
Detection method: anchor, split_2x, split_3x, left_flank, right_flank |
ed_tmpl |
Edit distance to consensus template |
ed_prev |
Edit distance to previous monomer |
ed_next |
Edit distance to next monomer |
period |
Detected repeat period in bp |
autocorr |
Autocorrelation value at detected period |
n_expected |
Expected count of monomers (array_len / period) |
ed_per_bp |
Normalized edit distance (ed / length) |
cv |
Coefficient of variation for lengths |
cut_sequence |
Anchor sequence used for splitting |
orientation |
fwd or rev (reverse complemented) |
sequence |
Actual DNA sequence (or - for pred_array/array rows) |
Monomers TSV Columns (.monomers.tsv)
Contains base-level monomers after recursive HOR decomposition. Unified format matching .hors.tsv plus parent_idx.
Each HOR is recursively decomposed until:
- No further periodicity detected (autocorrelation ≤ 0.5)
- Minimum length (5bp) reached
Row types (in order):
pred_array- Array-level summary rowbase_monomer- Base-level monomers from recursive decompositionmonomer- Non-decomposable monomers (e.g., telomeres)
| Column | Description |
|---|---|
array_id |
Array identifier |
type |
pred_array, base_monomer, monomer |
idx |
Global index within array (0-based) |
length |
Sequence length in bp |
source |
recursive_anchor, recursive_split, base, recursive_flank |
ed_tmpl |
Edit distance to submonomer consensus |
ed_prev |
Edit distance to previous base monomer |
ed_next |
Edit distance to next base monomer |
period |
Detected period at this level (0 if base) |
autocorr |
Autocorrelation value |
n_expected |
Always 1 for individual monomers |
ed_per_bp |
Normalized edit distance |
cv |
Coefficient of variation within parent group |
cut_sequence |
Inherited anchor sequence |
orientation |
Inherited from array (fwd/rev) |
parent_idx |
Index of parent HOR from .hors.tsv |
sequence |
Actual DNA sequence |
Example: α-satellite HOR Decomposition
For a typical α-satellite HOR (512bp → 3×171bp monomers):
.hors.tsv - 10 HOR monomers (~512bp each):
array_id type idx length period ...
chr1_centromere pred_array 10 5120 512 ...
chr1_centromere monomer 0 512 512 ...
chr1_centromere monomer 1 512 512 ...
...
chr1_centromere array 10 5120 512 ...
chr1_centromere consensus 10 512 512 ... [consensus seq]
.monomers.tsv - 30 base monomers (~171bp each):
array_id type idx length parent_idx ...
chr1_centromere pred_array 30 5120 - ...
chr1_centromere base_monomer 0 171 0 ...
chr1_centromere base_monomer 1 171 0 ...
chr1_centromere base_monomer 2 170 0 ...
chr1_centromere base_monomer 3 171 1 ...
...
.summary.tsv - Single row with both levels:
array_id length hor_period hor_n_monomers mono_period mono_n_monomers ...
chr1_centromere 5120 512 10 171 30 ...
Algorithm
ArraySplitter employs an autocorrelation-based algorithm for detecting repeat periods and decomposing satellite DNA arrays.
1. Canonical Orientation
Arrays are oriented to canonical form:
- Primary rule: A > T (more A's than T's)
- Secondary rule: C > G (if A=T)
- Non-canonical arrays are reverse complemented
2. Period Detection via Autocorrelation
The algorithm computes sequence autocorrelation to detect periodicity:
autocorr(offset) = matches / comparisons
Where matches counts identical nucleotides at positions i and i + offset.
Key innovations:
- Random expectation correction: Subtracts expected random match rate based on nucleotide composition
- Refined period search: Uses FFT-like peak detection to find true period vs harmonics
- Confidence scoring: Autocorrelation excess over random indicates detection confidence
3. Anchor Selection
For the detected period, finds optimal anchor (cut sequence) using:
- K-mer enumeration: Extract all k-mers (k=10 by default) from the sequence
- Position analysis: For each k-mer, record all occurrence positions
- Scoring metrics:
- Uniqueness: Fraction of occurrences exactly
periodapart - Regularity: How evenly spaced the occurrences are
- Uniqueness: Fraction of occurrences exactly
- Combined score:
uniqueness × regularity - Deterministic selection: K-mers sorted lexicographically for reproducible tie-breaking
4. Array Decomposition
Using the selected anchor:
- Split array at all anchor occurrences
- First fragment → left flank (if < 70% of period)
- Middle fragments → monomers
- Last fragment → right flank (if < 70% of period)
- Apply heuristics for multiplet splitting (doublets, triplets, etc.)
5. Output Generation
Results are:
- Sorted by chromosome (natural order: 1, 2, ..., 22, X, Y, M)
- Within chromosome, sorted by start position
- Fully deterministic across runs
Methods
autocorr (Default)
Uses autocorrelation for period detection. Best for:
- Regular tandem repeats
- Alpha satellite arrays
- HOR (Higher Order Repeat) structures
classic
Uses frequency suffix tree approach. Better for:
- Irregular or degenerate repeats
- Very short arrays
- Arrays with high mutation rates
both
Tries autocorrelation first, falls back to classic if autocorr fails.
Command Line Options
arraysplitter --help
Options:
-i, --input <FILE> Input FASTA file
-o, --output <PREFIX> Output prefix
-t, --threads <N> Number of threads [default: all cores]
-c, --cuts <SEQ,SEQ> Predefined cut sequences (comma-separated)
-d, --depth <N> Max depth for cut search [default: 100]
--method <METHOD> Detection method: autocorr, classic, both [default: autocorr]
--max-ed-len <N> Max monomer length for edit distance [default: 10000]
--stats Print detailed statistics
--top-outliers <N> Number of outliers to show [default: 10]
-V, --version Print version
Citation
If you use ArraySplitter in your research, please cite: [Publication pending]
Contact
For questions or support: ad3002@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arraysplitter-1.7.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: arraysplitter-1.7.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: Python 3, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5cf1a8098b73d322881b54e2a4bd7032e297b7972144596ae2ec9c59b2c7a0b5
|
|
| MD5 |
b8c390d7ee3507249cdb639161cc0564
|
|
| BLAKE2b-256 |
45abed5e60a931ff1d2d95ebcbcfb6853374d85d85f5f9c3a7b46ac80fb66cd0
|
Provenance
The following attestation bundles were made for arraysplitter-1.7.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release-rust.yml on aglabx/ArraySplitter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arraysplitter-1.7.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
5cf1a8098b73d322881b54e2a4bd7032e297b7972144596ae2ec9c59b2c7a0b5 - Sigstore transparency entry: 869814957
- Sigstore integration time:
-
Permalink:
aglabx/ArraySplitter@630b50233edc081ddf3625fe7454aa0b63ab1236 -
Branch / Tag:
refs/tags/v1.7.3 - Owner: https://github.com/aglabx
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-rust.yml@630b50233edc081ddf3625fe7454aa0b63ab1236 -
Trigger Event:
push
-
Statement type:
File details
Details for the file arraysplitter-1.7.3-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: arraysplitter-1.7.3-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 956.9 kB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4136e6c15c6310c2eccdee1fc981c34d11e0e7ad2128ac9b085020eaaa7ca7de
|
|
| MD5 |
4215356bb9f16f5c04853bb700b8e3a2
|
|
| BLAKE2b-256 |
88cc07ed51040be5af372f699e81e897f5cb21243ce0a1cb83b2d6f043845c6b
|
Provenance
The following attestation bundles were made for arraysplitter-1.7.3-py3-none-macosx_11_0_arm64.whl:
Publisher:
release-rust.yml on aglabx/ArraySplitter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arraysplitter-1.7.3-py3-none-macosx_11_0_arm64.whl -
Subject digest:
4136e6c15c6310c2eccdee1fc981c34d11e0e7ad2128ac9b085020eaaa7ca7de - Sigstore transparency entry: 869814968
- Sigstore integration time:
-
Permalink:
aglabx/ArraySplitter@630b50233edc081ddf3625fe7454aa0b63ab1236 -
Branch / Tag:
refs/tags/v1.7.3 - Owner: https://github.com/aglabx
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-rust.yml@630b50233edc081ddf3625fe7454aa0b63ab1236 -
Trigger Event:
push
-
Statement type:
File details
Details for the file arraysplitter-1.7.3-py3-none-macosx_10_12_x86_64.whl.
File metadata
- Download URL: arraysplitter-1.7.3-py3-none-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: Python 3, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f388b92f1a250ece8c76df79ce2a94b26015dd02f920f120ee4d746fb53609ae
|
|
| MD5 |
a4c0c37f381ef058107f736ba38aec4d
|
|
| BLAKE2b-256 |
2f8b024104eda74768904e15b4d7bc2e2e4b1167bf286ed389643096a409b357
|
Provenance
The following attestation bundles were made for arraysplitter-1.7.3-py3-none-macosx_10_12_x86_64.whl:
Publisher:
release-rust.yml on aglabx/ArraySplitter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arraysplitter-1.7.3-py3-none-macosx_10_12_x86_64.whl -
Subject digest:
f388b92f1a250ece8c76df79ce2a94b26015dd02f920f120ee4d746fb53609ae - Sigstore transparency entry: 869814942
- Sigstore integration time:
-
Permalink:
aglabx/ArraySplitter@630b50233edc081ddf3625fe7454aa0b63ab1236 -
Branch / Tag:
refs/tags/v1.7.3 - Owner: https://github.com/aglabx
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-rust.yml@630b50233edc081ddf3625fe7454aa0b63ab1236 -
Trigger Event:
push
-
Statement type: