Skip to main content

High-performance biological sequence analysis library with GPU acceleration

Project description

Seqcore

Seqcore Logo

High-performance biological sequence analysis library for Python.

A unified, GPU-accelerated library for genomics, proteomics, structural biology, and drug design.

Installation

pip install seqcore

With GPU support:

pip install seqcore[gpu]

With all optional dependencies:

pip install seqcore[full]

Quick Start

import seqcore as sc

# DNA sequences - efficient 2-bit encoding
dna = sc.DNAArray("ACGTACGTACGT" * 1_000_000)

# Batch operations
sequences = sc.DNAArray([
    "ACGTACGT",
    "TGCATGCA",
    "GGGGCCCC",
])

# Vectorized operations
gc = sc.gc_content(sequences)
lengths = sc.length(sequences)
rev_comp = sc.reverse_complement(sequences)

# Translation
proteins = sc.translate(sequences)

Features

Sequence Operations

# GC content, molecular weight, length
gc = sc.gc_content(dna)
mw = sc.molecular_weight(protein)

# Transcription and translation
rna = sc.transcribe(dna)
protein = sc.translate(dna, frame=0)

# K-mer operations
kmers = sc.extract_kmers(sequences, k=21)
kmer_counts = sc.count_kmers(sequences, k=21)

Sequence Alignment

# Pairwise alignment
result = sc.align(query, reference)
print(result.score, result.identity, result.cigar)

# Distance matrices
dm = sc.pairwise_distance(sequences, metric="edit")

# Pattern matching
matches = sc.find_pattern(sequences, "ATG[ACGT]{30,100}TAA")

File I/O

# Auto-detect format
data = sc.read("sequences.fasta")
data = sc.read("structure.pdb")
data = sc.read("reads.fastq.gz")

# Streaming for large files
for batch in sc.read_stream("huge.fastq.gz", batch_size=100_000):
    results = process(batch)

# Database fetching
seq = sc.fetch("NP_000509")      # NCBI/UniProt
structure = sc.fetch("1ABC")     # PDB

Structural Biology

# Load structure
structure = sc.read("protein.pdb")

# Access data
print(structure.chains)      # ['A', 'B']
print(structure.n_residues)  # 265

# Distance matrix
dm = sc.distance_matrix(structure, selection="CA")

# Find contacts
contacts = sc.find_contacts(structure, cutoff=4.0)

# RMSD calculation
rmsd = sc.rmsd(structure1, structure2, align=True)

# Surface analysis
sasa = sc.sasa(structure)
surface = sc.surface_residues(structure, threshold=25.0)

# Binding pockets
pockets = sc.find_pockets(structure)

Drug Design

# Small molecules
mol = sc.Molecule.from_smiles("CCO")

# Molecular properties
mw = sc.molecular_weight(molecules)
logp = sc.logp(molecules)
hbd = sc.h_bond_donors(molecules)

# ADMET filters
passes_lipinski = sc.lipinski_filter(molecules)
bbb_permeable = sc.bbb_filter(molecules)

# Fingerprints and similarity
fps = sc.morgan_fingerprint(molecules, radius=2)
similarity = sc.tanimoto_similarity(fps)

# Substructure search
matches = sc.substructure_search(molecules, "c1ccccc1")

Phylogenetics

# Tree construction
tree = sc.neighbor_joining(sequences)
tree = sc.upgma(sequences)

# Tree operations
print(tree.newick())
dist = tree.distance("Species_A", "Species_B")
subtree = tree.prune(["A", "B", "C"])

Population Genetics

# Variant analysis
variants = sc.read("variants.vcf")
af = sc.allele_frequency(variants)
maf = sc.minor_allele_frequency(variants)

# Population statistics
fst = sc.fst(pop1, pop2)
pi = sc.nucleotide_diversity(sequences)
d = sc.tajimas_d(sequences)

# Linkage disequilibrium
ld = sc.linkage_disequilibrium(variants)

GPU Acceleration

# Check GPU availability
if sc.gpu_available():
    print(sc.gpu_info())

# Device context
with sc.device("cuda:0"):
    result = sc.align(sequences, reference)

# Memory management
sc.set_memory_limit("8GB")
sc.clear_gpu_cache()

# Timing
with sc.timer() as t:
    result = sc.align(sequences, reference)
print(f"Completed in {t.elapsed:.2f}s")

Interoperability

# NumPy
arr = sequences.to_numpy()
sequences = sc.DNAArray.from_numpy(arr)

# pandas
df = sequences.to_dataframe()
df = structure.to_dataframe()

# Biopython
bio_seq = sequences[0].to_biopython()
sc_seq = sc.DNAArray.from_biopython(bio_seq)

# RDKit
rdkit_mol = molecule.to_rdkit()
sc_mol = sc.Molecule.from_rdkit(rdkit_mol)

Performance

Seqcore provides significant speedups over traditional libraries:

Operation Biopython Seqcore Speedup
GC Content (1M seqs) 45s 0.8s 56x
Reverse Complement 12s 0.1s 120x
Translation 38s 0.5s 76x
K-mer Counting 89s 1.2s 74x

Benchmarks on AMD Ryzen 9 5900X, 32GB RAM. GPU benchmarks show additional 10-50x speedup.

Requirements

  • Python 3.9+
  • NumPy 1.21+

Optional:

  • CuPy (GPU acceleration)
  • Biopython (interoperability)
  • RDKit (molecular operations)
  • MDAnalysis (structure analysis)

Contributing

Contributions welcome. See CONTRIBUTING.md.

License

MIT License. See LICENSE.

Author

Dr. Pritam Kumar Panda Stanford University Email: pritam@stanford.edu

Citation

If you use Seqcore in your research, please cite:

@software{seqcore,
  author = {Panda, Pritam Kumar},
  title = {Seqcore: High-performance biological sequence analysis},
  url = {https://github.com/pritampanda15/seqcore},
  version = {0.1.0},
  year = {2025},
  institution = {Stanford University}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqcore-0.2.0.tar.gz (43.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

seqcore-0.2.0-py3-none-any.whl (49.4 kB view details)

Uploaded Python 3

File details

Details for the file seqcore-0.2.0.tar.gz.

File metadata

  • Download URL: seqcore-0.2.0.tar.gz
  • Upload date:
  • Size: 43.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for seqcore-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8fbf14d53fb84cb3d223d29dccaca223e73d0c487824d77e9f69e1e446d78c8f
MD5 b5a1a64b69cc6725f46e168b21b5f5aa
BLAKE2b-256 093ac1700b74dfa93a288176913fd23d027267adf1ee9092a61fe2f496db7e2b

See more details on using hashes here.

Provenance

The following attestation bundles were made for seqcore-0.2.0.tar.gz:

Publisher: publish.yml on pritampanda15/Seqcore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file seqcore-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: seqcore-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 49.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for seqcore-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3455e88947568e3f3fe7ba413568c445dc8dd97ec867978a386179b7bad92d19
MD5 fc648cc5fd67cbb198890f018ac06906
BLAKE2b-256 e3fba74b576d59a0788b77cf13a2df09ce991a1e4b4ecd46c58d67662a3c365b

See more details on using hashes here.

Provenance

The following attestation bundles were made for seqcore-0.2.0-py3-none-any.whl:

Publisher: publish.yml on pritampanda15/Seqcore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page