Skip to main content

De Novo Decomposition of Satellite DNA Arrays into Monomers within Telomere-to-Telomere Assemblies

Project description

ArraySplitter: De Novo Decomposition of Satellite DNA Arrays

Decomposes satellite DNA arrays into monomers within telomere-to-telomere (T2T) assemblies. Ideal for analyzing centromeric and pericentromeric regions on monomeric level.

Status: Production ready. Successfully handles arrays from kilobase to megabase scale.

Key Features:

  • De novo monomer identification without prior knowledge
  • Autocorrelation-based period detection for robust periodicity analysis
  • Automatic orientation to canonical form (A>T, C>G)
  • Deterministic output sorted by genomic coordinates
  • Multi-threaded processing

Performance: CHM13v2.0 assembly (~1300 alpha satellite arrays) processes in ~3.5 minutes (16 threads)

Installation

pip install arraysplitter

Or build from source:

cd src/rust/arraysplitter
cargo build --release

Quick Start

# Basic decomposition
arraysplitter -i arrays.fa -o output_prefix -t 16

# With predefined cut sequences
arraysplitter -i arrays.fa -o output_prefix -c ATG,CGCG -t 16

# Show version
arraysplitter --version

Output Files

All output is deterministically sorted by chromosome and genomic position (chr1 → chr22 → chrX → chrY → chrM).

File Description
.decomposed.fasta Monomers with orientation info in headers
.hors.tsv HOR-level decomposition (16 columns)
.monomers.tsv Base-level monomers from recursive decomposition (17 columns)
.summary.tsv One-row-per-array summary with HOR and monomer statistics (23 columns)
.lengths Fragment lengths for each array

Summary TSV Columns (.summary.tsv)

One row per array combining HOR-level and monomer-level statistics. Useful for overview analysis.

Column Description
array_id Array identifier (chr_start_end_len_period_type)
array_length Total array length in bp
orientation fwd or rev (reverse complemented to canonical)
method Detection method used (autocorr, classic)
HOR-level stats
hor_period Detected HOR period in bp
hor_autocorr Autocorrelation at HOR period
hor_n_monomers Number of HOR-level monomers
hor_mean_ed_tmpl Mean edit distance to HOR consensus
hor_mean_ed_prev Mean edit distance between adjacent HORs
hor_cv Coefficient of variation for HOR lengths
hor_consensus Consensus sequence at HOR level
hor_iupac IUPAC ambiguity codes (bases ≥20% frequency)
hor_quality Per-position support (digit 0-9, 9=90-100%)
Monomer-level stats
mono_period Median base monomer period
mono_autocorr Mean autocorrelation at monomer level
mono_n_monomers Total number of base monomers
mono_mean_ed_tmpl Mean edit distance to monomer consensus
mono_mean_ed_prev Mean edit distance between adjacent monomers
mono_cv Mean coefficient of variation
mono_consensus Consensus sequence at monomer level
mono_iupac IUPAC ambiguity codes
mono_quality Per-position support
cut_sequence Anchor k-mer used for splitting

HORs TSV Columns (.hors.tsv)

Contains the primary decomposition into HOR (Higher Order Repeat) monomers. Multiple rows per array.

Row types (in order):

  1. pred_array - Array-level prediction/header row
  2. flank - Terminal fragments <70% of period
  3. monomer - Full HOR monomers (sorted by idx)
  4. array - Summary statistics row
  5. consensus - Consensus sequence row
Column Description
array_id Array identifier (chr_start_end_len_period_type)
type pred_array, monomer, flank, array, consensus
idx Monomer index within array (0-based)
length Sequence length in bp
source Detection method: anchor, split_2x, split_3x, left_flank, right_flank
ed_tmpl Edit distance to consensus template
ed_prev Edit distance to previous monomer
ed_next Edit distance to next monomer
period Detected repeat period in bp
autocorr Autocorrelation value at detected period
n_expected Expected count of monomers (array_len / period)
ed_per_bp Normalized edit distance (ed / length)
cv Coefficient of variation for lengths
cut_sequence Anchor sequence used for splitting
orientation fwd or rev (reverse complemented)
sequence Actual DNA sequence (or - for pred_array/array rows)

Monomers TSV Columns (.monomers.tsv)

Contains base-level monomers after recursive HOR decomposition. Unified format matching .hors.tsv plus parent_idx.

Each HOR is recursively decomposed until:

  • No further periodicity detected (autocorrelation ≤ 0.5)
  • Minimum length (5bp) reached

Row types (in order):

  1. pred_array - Array-level summary row
  2. base_monomer - Base-level monomers from recursive decomposition
  3. monomer - Non-decomposable monomers (e.g., telomeres)
Column Description
array_id Array identifier
type pred_array, base_monomer, monomer
idx Global index within array (0-based)
length Sequence length in bp
source recursive_anchor, recursive_split, base, recursive_flank
ed_tmpl Edit distance to submonomer consensus
ed_prev Edit distance to previous base monomer
ed_next Edit distance to next base monomer
period Detected period at this level (0 if base)
autocorr Autocorrelation value
n_expected Always 1 for individual monomers
ed_per_bp Normalized edit distance
cv Coefficient of variation within parent group
cut_sequence Inherited anchor sequence
orientation Inherited from array (fwd/rev)
parent_idx Index of parent HOR from .hors.tsv
sequence Actual DNA sequence

Example: α-satellite HOR Decomposition

For a typical α-satellite HOR (512bp → 3×171bp monomers):

.hors.tsv - 10 HOR monomers (~512bp each):

array_id                type        idx  length  period  ...
chr1_centromere         pred_array  10   5120    512     ...
chr1_centromere         monomer     0    512     512     ...
chr1_centromere         monomer     1    512     512     ...
...
chr1_centromere         array       10   5120    512     ...
chr1_centromere         consensus   10   512     512     ... [consensus seq]

.monomers.tsv - 30 base monomers (~171bp each):

array_id                type          idx  length  parent_idx  ...
chr1_centromere         pred_array    30   5120    -           ...
chr1_centromere         base_monomer  0    171     0           ...
chr1_centromere         base_monomer  1    171     0           ...
chr1_centromere         base_monomer  2    170     0           ...
chr1_centromere         base_monomer  3    171     1           ...
...

.summary.tsv - Single row with both levels:

array_id         length  hor_period  hor_n_monomers  mono_period  mono_n_monomers  ...
chr1_centromere  5120    512         10              171          30               ...

Algorithm

ArraySplitter employs an autocorrelation-based algorithm for detecting repeat periods and decomposing satellite DNA arrays.

1. Canonical Orientation

Arrays are oriented to canonical form:

  • Primary rule: A > T (more A's than T's)
  • Secondary rule: C > G (if A=T)
  • Non-canonical arrays are reverse complemented

2. Period Detection via Autocorrelation

The algorithm computes sequence autocorrelation to detect periodicity:

autocorr(offset) = matches / comparisons

Where matches counts identical nucleotides at positions i and i + offset.

Key innovations:

  • Random expectation correction: Subtracts expected random match rate based on nucleotide composition
  • Refined period search: Uses FFT-like peak detection to find true period vs harmonics
  • Confidence scoring: Autocorrelation excess over random indicates detection confidence

3. Anchor Selection

For the detected period, finds optimal anchor (cut sequence) using:

  1. K-mer enumeration: Extract all k-mers (k=10 by default) from the sequence
  2. Position analysis: For each k-mer, record all occurrence positions
  3. Scoring metrics:
    • Uniqueness: Fraction of occurrences exactly period apart
    • Regularity: How evenly spaced the occurrences are
  4. Combined score: uniqueness × regularity
  5. Deterministic selection: K-mers sorted lexicographically for reproducible tie-breaking

4. Array Decomposition

Using the selected anchor:

  1. Split array at all anchor occurrences
  2. First fragment → left flank (if < 70% of period)
  3. Middle fragments → monomers
  4. Last fragment → right flank (if < 70% of period)
  5. Apply heuristics for multiplet splitting (doublets, triplets, etc.)

5. Output Generation

Results are:

  • Sorted by chromosome (natural order: 1, 2, ..., 22, X, Y, M)
  • Within chromosome, sorted by start position
  • Fully deterministic across runs

Methods

autocorr (Default)

Uses autocorrelation for period detection. Best for:

  • Regular tandem repeats
  • Alpha satellite arrays
  • HOR (Higher Order Repeat) structures

classic

Uses frequency suffix tree approach. Better for:

  • Irregular or degenerate repeats
  • Very short arrays
  • Arrays with high mutation rates

both

Tries autocorrelation first, falls back to classic if autocorr fails.

Command Line Options

arraysplitter --help

Options:
  -i, --input <FILE>       Input FASTA file
  -o, --output <PREFIX>    Output prefix
  -t, --threads <N>        Number of threads [default: all cores]
  -c, --cuts <SEQ,SEQ>     Predefined cut sequences (comma-separated)
  -d, --depth <N>          Max depth for cut search [default: 100]
  --method <METHOD>        Detection method: autocorr, classic, both [default: autocorr]
  --max-ed-len <N>         Max monomer length for edit distance [default: 10000]
  --stats                  Print detailed statistics
  --top-outliers <N>       Number of outliers to show [default: 10]
  -V, --version            Print version

Citation

If you use ArraySplitter in your research, please cite: [Publication pending]

Contact

For questions or support: ad3002@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

arraysplitter-1.7.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

arraysplitter-1.7.3-py3-none-macosx_11_0_arm64.whl (956.9 kB view details)

Uploaded Python 3macOS 11.0+ ARM64

arraysplitter-1.7.3-py3-none-macosx_10_12_x86_64.whl (1.1 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file arraysplitter-1.7.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arraysplitter-1.7.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5cf1a8098b73d322881b54e2a4bd7032e297b7972144596ae2ec9c59b2c7a0b5
MD5 b8c390d7ee3507249cdb639161cc0564
BLAKE2b-256 45abed5e60a931ff1d2d95ebcbcfb6853374d85d85f5f9c3a7b46ac80fb66cd0

See more details on using hashes here.

Provenance

The following attestation bundles were made for arraysplitter-1.7.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release-rust.yml on aglabx/ArraySplitter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arraysplitter-1.7.3-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arraysplitter-1.7.3-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4136e6c15c6310c2eccdee1fc981c34d11e0e7ad2128ac9b085020eaaa7ca7de
MD5 4215356bb9f16f5c04853bb700b8e3a2
BLAKE2b-256 88cc07ed51040be5af372f699e81e897f5cb21243ce0a1cb83b2d6f043845c6b

See more details on using hashes here.

Provenance

The following attestation bundles were made for arraysplitter-1.7.3-py3-none-macosx_11_0_arm64.whl:

Publisher: release-rust.yml on aglabx/ArraySplitter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arraysplitter-1.7.3-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for arraysplitter-1.7.3-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f388b92f1a250ece8c76df79ce2a94b26015dd02f920f120ee4d746fb53609ae
MD5 a4c0c37f381ef058107f736ba38aec4d
BLAKE2b-256 2f8b024104eda74768904e15b4d7bc2e2e4b1167bf286ed389643096a409b357

See more details on using hashes here.

Provenance

The following attestation bundles were made for arraysplitter-1.7.3-py3-none-macosx_10_12_x86_64.whl:

Publisher: release-rust.yml on aglabx/ArraySplitter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page