Skip to main content

High-performance biological sequence analysis library with GPU acceleration

Project description

Seqcore

Seqcore Logo

High-performance biological sequence analysis library for Python.

A unified, GPU-accelerated library for genomics, proteomics, structural biology, and drug design.

Installation

pip install seqcore

With GPU support:

pip install seqcore[gpu]

With all optional dependencies:

pip install seqcore[full]

Quick Start

import seqcore as sc

# DNA sequences - efficient 2-bit encoding
dna = sc.DNAArray("ACGTACGTACGT" * 1_000_000)

# Batch operations
sequences = sc.DNAArray([
    "ACGTACGT",
    "TGCATGCA",
    "GGGGCCCC",
])

# Vectorized operations
gc = sc.gc_content(sequences)
lengths = sc.length(sequences)
rev_comp = sc.reverse_complement(sequences)

# Translation
proteins = sc.translate(sequences)

Features

Sequence Operations

# GC content, molecular weight, length
gc = sc.gc_content(dna)
mw = sc.molecular_weight(protein)

# Transcription and translation
rna = sc.transcribe(dna)
protein = sc.translate(dna, frame=0)

# K-mer operations
kmers = sc.extract_kmers(sequences, k=21)
kmer_counts = sc.count_kmers(sequences, k=21)

Sequence Alignment

# Pairwise alignment
result = sc.align(query, reference)
print(result.score, result.identity, result.cigar)

# Distance matrices
dm = sc.pairwise_distance(sequences, metric="edit")

# Pattern matching
matches = sc.find_pattern(sequences, "ATG[ACGT]{30,100}TAA")

File I/O

# Auto-detect format
data = sc.read("sequences.fasta")
data = sc.read("structure.pdb")
data = sc.read("reads.fastq.gz")

# Streaming for large files
for batch in sc.read_stream("huge.fastq.gz", batch_size=100_000):
    results = process(batch)

# Database fetching
seq = sc.fetch("NP_000509")      # NCBI/UniProt
structure = sc.fetch("1ABC")     # PDB

Structural Biology

# Load structure
structure = sc.read("protein.pdb")

# Access data
print(structure.chains)      # ['A', 'B']
print(structure.n_residues)  # 265

# Distance matrix
dm = sc.distance_matrix(structure, selection="CA")

# Find contacts
contacts = sc.find_contacts(structure, cutoff=4.0)

# RMSD calculation
rmsd = sc.rmsd(structure1, structure2, align=True)

# Surface analysis
sasa = sc.sasa(structure)
surface = sc.surface_residues(structure, threshold=25.0)

# Binding pockets
pockets = sc.find_pockets(structure)

Drug Design

# Small molecules
mol = sc.Molecule.from_smiles("CCO")

# Molecular properties
mw = sc.molecular_weight(molecules)
logp = sc.logp(molecules)
hbd = sc.h_bond_donors(molecules)

# ADMET filters
passes_lipinski = sc.lipinski_filter(molecules)
bbb_permeable = sc.bbb_filter(molecules)

# Fingerprints and similarity
fps = sc.morgan_fingerprint(molecules, radius=2)
similarity = sc.tanimoto_similarity(fps)

# Substructure search
matches = sc.substructure_search(molecules, "c1ccccc1")

Phylogenetics

# Tree construction
tree = sc.neighbor_joining(sequences)
tree = sc.upgma(sequences)

# Tree operations
print(tree.newick())
dist = tree.distance("Species_A", "Species_B")
subtree = tree.prune(["A", "B", "C"])

Population Genetics

# Variant analysis
variants = sc.read("variants.vcf")
af = sc.allele_frequency(variants)
maf = sc.minor_allele_frequency(variants)

# Population statistics
fst = sc.fst(pop1, pop2)
pi = sc.nucleotide_diversity(sequences)
d = sc.tajimas_d(sequences)

# Linkage disequilibrium
ld = sc.linkage_disequilibrium(variants)

GPU Acceleration

# Check GPU availability
if sc.gpu_available():
    print(sc.gpu_info())

# Device context
with sc.device("cuda:0"):
    result = sc.align(sequences, reference)

# Memory management
sc.set_memory_limit("8GB")
sc.clear_gpu_cache()

# Timing
with sc.timer() as t:
    result = sc.align(sequences, reference)
print(f"Completed in {t.elapsed:.2f}s")

Interoperability

# NumPy
arr = sequences.to_numpy()
sequences = sc.DNAArray.from_numpy(arr)

# pandas
df = sequences.to_dataframe()
df = structure.to_dataframe()

# Biopython
bio_seq = sequences[0].to_biopython()
sc_seq = sc.DNAArray.from_biopython(bio_seq)

# RDKit
rdkit_mol = molecule.to_rdkit()
sc_mol = sc.Molecule.from_rdkit(rdkit_mol)

Performance

Seqcore provides significant speedups over traditional libraries:

Operation Biopython Seqcore Speedup
GC Content (1M seqs) 45s 0.8s 56x
Reverse Complement 12s 0.1s 120x
Translation 38s 0.5s 76x
K-mer Counting 89s 1.2s 74x

Benchmarks on AMD Ryzen 9 5900X, 32GB RAM. GPU benchmarks show additional 10-50x speedup.

Requirements

  • Python 3.9+
  • NumPy 1.21+

Optional:

  • CuPy (GPU acceleration)
  • Biopython (interoperability)
  • RDKit (molecular operations)
  • MDAnalysis (structure analysis)

Contributing

Contributions welcome. See CONTRIBUTING.md.

License

MIT License. See LICENSE.

Author

Dr. Pritam Kumar Panda Stanford University Email: pritam@stanford.edu

Citation

If you use Seqcore in your research, please cite:

@software{seqcore,
  author = {Panda, Pritam Kumar},
  title = {Seqcore: High-performance biological sequence analysis},
  url = {https://github.com/pritampanda15/seqcore},
  version = {0.3.0},
  year = {2025},
  institution = {Stanford University}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqcore-0.3.0.tar.gz (43.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

seqcore-0.3.0-py3-none-any.whl (49.4 kB view details)

Uploaded Python 3

File details

Details for the file seqcore-0.3.0.tar.gz.

File metadata

  • Download URL: seqcore-0.3.0.tar.gz
  • Upload date:
  • Size: 43.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for seqcore-0.3.0.tar.gz
Algorithm Hash digest
SHA256 590e34ffc700b332c50923ad70750121006fbcfe3debdc1446c1d18f6739b1a8
MD5 7110c57f819d67a808c5c016c04a0033
BLAKE2b-256 0e9cd397f8fc32a6a5dc35d3f58b5988aa743a05981140f4e4c5cb16d8e3fc1c

See more details on using hashes here.

Provenance

The following attestation bundles were made for seqcore-0.3.0.tar.gz:

Publisher: publish.yml on pritampanda15/Seqcore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file seqcore-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: seqcore-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 49.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for seqcore-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 332a2949c4383845a362d45374140460198b116ee319ac48a13942845c610e55
MD5 24ac58729bd369de3c711e69ccb3f6db
BLAKE2b-256 beaa7b4636fdd9e81ad64b80a867c2e9f43b6e5ca03b6187df9ec6f47cb11a0d

See more details on using hashes here.

Provenance

The following attestation bundles were made for seqcore-0.3.0-py3-none-any.whl:

Publisher: publish.yml on pritampanda15/Seqcore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page