Skip to main content

High-performance biological sequence analysis library with GPU acceleration

Project description

Seqcore

Seqcore Logo

High-performance biological sequence analysis library for Python.

A unified, GPU-accelerated library for genomics, proteomics, structural biology, and drug design.

Installation

pip install seqcore

With GPU support:

pip install seqcore[gpu]

With all optional dependencies:

pip install seqcore[full]

Quick Start

import seqcore as sc

# DNA sequences - efficient 2-bit encoding
dna = sc.DNAArray("ACGTACGTACGT" * 1_000_000)

# Batch operations
sequences = sc.DNAArray([
    "ACGTACGT",
    "TGCATGCA",
    "GGGGCCCC",
])

# Vectorized operations
gc = sc.gc_content(sequences)
lengths = sc.length(sequences)
rev_comp = sc.reverse_complement(sequences)

# Translation
proteins = sc.translate(sequences)

Features

Sequence Operations

# GC content, molecular weight, length
gc = sc.gc_content(dna)
mw = sc.molecular_weight(protein)

# Transcription and translation
rna = sc.transcribe(dna)
protein = sc.translate(dna, frame=0)

# K-mer operations
kmers = sc.extract_kmers(sequences, k=21)
kmer_counts = sc.count_kmers(sequences, k=21)

Sequence Alignment

# Pairwise alignment
result = sc.align(query, reference)
print(result.score, result.identity, result.cigar)

# Distance matrices
dm = sc.pairwise_distance(sequences, metric="edit")

# Pattern matching
matches = sc.find_pattern(sequences, "ATG[ACGT]{30,100}TAA")

File I/O

# Auto-detect format
data = sc.read("sequences.fasta")
data = sc.read("structure.pdb")
data = sc.read("reads.fastq.gz")

# Streaming for large files
for batch in sc.read_stream("huge.fastq.gz", batch_size=100_000):
    results = process(batch)

# Database fetching
seq = sc.fetch("NP_000509")      # NCBI/UniProt
structure = sc.fetch("1ABC")     # PDB

Structural Biology

# Load structure
structure = sc.read("protein.pdb")

# Access data
print(structure.chains)      # ['A', 'B']
print(structure.n_residues)  # 265

# Distance matrix
dm = sc.distance_matrix(structure, selection="CA")

# Find contacts
contacts = sc.find_contacts(structure, cutoff=4.0)

# RMSD calculation
rmsd = sc.rmsd(structure1, structure2, align=True)

# Surface analysis
sasa = sc.sasa(structure)
surface = sc.surface_residues(structure, threshold=25.0)

# Binding pockets
pockets = sc.find_pockets(structure)

Drug Design

# Small molecules
mol = sc.Molecule.from_smiles("CCO")

# Molecular properties
mw = sc.molecular_weight(molecules)
logp = sc.logp(molecules)
hbd = sc.h_bond_donors(molecules)

# ADMET filters
passes_lipinski = sc.lipinski_filter(molecules)
bbb_permeable = sc.bbb_filter(molecules)

# Fingerprints and similarity
fps = sc.morgan_fingerprint(molecules, radius=2)
similarity = sc.tanimoto_similarity(fps)

# Substructure search
matches = sc.substructure_search(molecules, "c1ccccc1")

Phylogenetics

# Tree construction
tree = sc.neighbor_joining(sequences)
tree = sc.upgma(sequences)

# Tree operations
print(tree.newick())
dist = tree.distance("Species_A", "Species_B")
subtree = tree.prune(["A", "B", "C"])

Population Genetics

# Variant analysis
variants = sc.read("variants.vcf")
af = sc.allele_frequency(variants)
maf = sc.minor_allele_frequency(variants)

# Population statistics
fst = sc.fst(pop1, pop2)
pi = sc.nucleotide_diversity(sequences)
d = sc.tajimas_d(sequences)

# Linkage disequilibrium
ld = sc.linkage_disequilibrium(variants)

GPU Acceleration

# Check GPU availability
if sc.gpu_available():
    print(sc.gpu_info())

# Device context
with sc.device("cuda:0"):
    result = sc.align(sequences, reference)

# Memory management
sc.set_memory_limit("8GB")
sc.clear_gpu_cache()

# Timing
with sc.timer() as t:
    result = sc.align(sequences, reference)
print(f"Completed in {t.elapsed:.2f}s")

Interoperability

# NumPy
arr = sequences.to_numpy()
sequences = sc.DNAArray.from_numpy(arr)

# pandas
df = sequences.to_dataframe()
df = structure.to_dataframe()

# Biopython
bio_seq = sequences[0].to_biopython()
sc_seq = sc.DNAArray.from_biopython(bio_seq)

# RDKit
rdkit_mol = molecule.to_rdkit()
sc_mol = sc.Molecule.from_rdkit(rdkit_mol)

Performance

Seqcore provides significant speedups over traditional libraries:

Operation Biopython Seqcore Speedup
GC Content (1M seqs) 45s 0.8s 56x
Reverse Complement 12s 0.1s 120x
Translation 38s 0.5s 76x
K-mer Counting 89s 1.2s 74x

Benchmarks on AMD Ryzen 9 5900X, 32GB RAM. GPU benchmarks show additional 10-50x speedup.

Requirements

  • Python 3.9+
  • NumPy 1.21+

Optional:

  • CuPy (GPU acceleration)
  • Biopython (interoperability)
  • RDKit (molecular operations)
  • MDAnalysis (structure analysis)

Contributing

Contributions welcome. See CONTRIBUTING.md.

License

MIT License. See LICENSE.

Author

Dr. Pritam Kumar Panda Stanford University Email: pritam@stanford.edu

Citation

If you use Seqcore in your research, please cite:

@software{seqcore,
  author = {Panda, Pritam Kumar},
  title = {Seqcore: High-performance biological sequence analysis},
  url = {https://github.com/pritampanda15/seqcore},
  version = {0.1.0},
  year = {2025},
  institution = {Stanford University}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqcore-0.1.0.tar.gz (43.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

seqcore-0.1.0-py3-none-any.whl (49.4 kB view details)

Uploaded Python 3

File details

Details for the file seqcore-0.1.0.tar.gz.

File metadata

  • Download URL: seqcore-0.1.0.tar.gz
  • Upload date:
  • Size: 43.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for seqcore-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b58c21fadff1c22681b0200ce04f74f0de94df6f8a83e3861a024f26f154140b
MD5 d0bce0c06beac5bcc48e97f7fdca347e
BLAKE2b-256 2348313ae21394b3a4d92ae73978f25b6704669c47440b27020a858ebefffa8a

See more details on using hashes here.

Provenance

The following attestation bundles were made for seqcore-0.1.0.tar.gz:

Publisher: publish.yml on pritampanda15/Seqcore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file seqcore-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: seqcore-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 49.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for seqcore-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ea5116322e87ae3caec88e314246a96e98a9fe9aff379515252fa74c82094ea1
MD5 d2aee7fa90455dbd523cad90ef54ed23
BLAKE2b-256 39f1edf30dd98461031710ce477c612ec6985ac352dcfd9fb67a99c6f2dc74c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for seqcore-0.1.0-py3-none-any.whl:

Publisher: publish.yml on pritampanda15/Seqcore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page