Asynchronous genomic comparison and visualization toolkit to perform on local devices(parsers, loaders, matchers, visualizers).
Project description
PharmaSight
An open-source asynchronous proteomic and molecular comparison toolkit built by Biotronics AI. It supports multi-format parsing, efficient memory management with lazy-loading, and local similarity matching for pharma, drug repurposing/repositioning and molecular analysis workflows.
Documentation & Resources:
- GitHub Repository: Biotronics-Ai/PharmaSight
- Research Paper: PharmaSight Preprint
What it delivers in practice?
- Multi-format molecular comparison within seconds: Parse 14+ file formats (PDB, MMCIF, FASTA, MZML, SMILES, InChI, and more), stream in batches, eliminate weak candidates fast, then score the top matches with kernel-based similarity search.
- Memory-efficient batch processing: Lazy-load sequences from memory-mapped files, process in configurable batches (default 25), and maintain stable RAM usage even with 20,000+ samples.
- Unified extraction pipeline: Single-pass
extract_all()method across all parsers normalizes sequence and metadata extraction, reducing code duplication and improving performance. - Multi-format molecular handling: Normalize heterogeneous sequence formats (genomic, proteomic, cheminformatic) to be compatible and directly comparable.
Features
Supported Data File Formats:
- Sequence: FASTA, FASTQ, GenBank formats.
- Structure: PDB (Protein Data Bank), MMCIF (mmCIF).
- Mass Spectrometry: MZML, MZXML, MZIdentML, MZTab.
- Genomics: PepXML.
- Chemistry: SMILES, InChI, SDF, MOL files.
- Network: EdgeList, BioPAX RDF.
Core Features:
MolSample(lazy-loading with memmap read-only mode)- Unified
extract_all()extraction across 14 parser classes - Batch processing with memory monitoring (psutil integration)
KernelMatrixfor fast similarity search and best-match discovery- Async/concurrent file parsing with garbage collection between batches
- Flexible metadata extraction with normalized output format
Memory-aware architecture:
- Lazy-loading: Store only file paths initially, load sequences on-demand
- Memory monitoring: Track RSS memory before/after each batch
- Batch processing: Process 25 files concurrently, then gc.collect()
- Memmap buffers: Read-only access (
mmap_mode='r') for minimal overhead
Install
python -m venv .venv
source .venv/bin/activate
pip install pharmasight
Dependencies are defined in pyproject.toml / requirements.txt.
Project layout
models.py— MolSample class, lazy-loading with memmap support.parsers.py— 14 file-format parsers with unifiedextract_all()method.kernel.py— KernelMatrix for similarity search and matching.base.py— base components and utility helpers.
Quick start
1) Unified extraction across multiple formats
All parsers implement the same extract_all() interface for consistent, single-pass extraction:
from pharmasight.parsers import FASTAParser, PDBStructureParser
# FASTA example
fasta_parser = FASTAParser()
result = fasta_parser.extract_all(raw_fasta_data)
sequence = result["sequence"]
metadata = result["metadata"]
# PDB example
pdb_parser = PDBStructureParser()
result = pdb_parser.extract_all(raw_pdb_data)
structure_vectors = result["sequence"]
pdb_metadata = result["metadata"]
Real life: Parse genomic sequences (FASTA), protein structures (PDB), and mass spectrometry data (MZML) in a single, unified pipeline without format-specific extraction logic.
2) Batch processing with memory monitoring
Load and parse large sample collections efficiently with automatic memory management:
from pharmasight.extract import AsyncExtractor
from pharmasight.models import MolSample
extractor = AsyncExtractor(
batch_size=25,
memmap_dir="mem_map",
logs_dir="logs"
)
samples = await extractor.extract_from_directory(
directory_path="data_samples/trcc_pdb",
recursive=True
)
# Samples are lazy-loaded; sequences stored as memmap file paths
for sample in samples:
# Access sequence only when needed (lazy-load)
seq = sample.sequence
print(f"{sample.sample_id}: {seq.shape}")
Real life: Process 19,000+ molecular structures without exhausting RAM; batch operations with memory delta logging ensure stable performance.
3) Two-directory similarity workflow
Build a reference kernel from one directory, then search against samples in another:
from pharmasight.kernel import KernelMatrix
from pharmasight.extract import AsyncExtractor
# Phase 1: Build kernel from reference directory
extractor = AsyncExtractor(batch_size=25, memmap_dir="mem_map")
kernel_samples = await extractor.extract_from_directory("data_samples/trcc_pdb")
kernel = KernelMatrix(kernel_samples, memmap_dir="mem_map", logs_dir="logs")
# Phase 2: Search target samples in kernel
target_samples = await extractor.extract_from_directory("data_samples/trcc")
results = {}
for target in target_samples:
best_matches = kernel.best_match(target.sample_id, top_k=5)
results[target.sample_id] = best_matches
Real life: Compare query molecules against a pharmaceutical database; find top-5 closest structural matches within seconds.
4) Kernel matrix with lazy-loaded sequences
Use KernelMatrix for fast similarity search across large sample pools without loading everything into RAM:
from pharmasight.kernel import KernelMatrix
kernel = KernelMatrix(
samples=samples,
memmap_dir="mem_map",
logs_dir="logs"
)
# Find top-5 most similar samples to a query
best_matches = kernel.best_match("query_sample_id", top_k=5)
print(f"Top matches: {best_matches}")
kernel.cleanup()
⚠️ Memory note: Kernel matrix builds are based on sequence length and sample count. Set memmap_dir for disk buffering and monitor memory usage via psutil. Ensure sequences are normalized to consistent dimensions (padded/truncated to base_length).
5) Direct MolSample creation with lazy-loading
Create molecular samples manually for custom workflows:
import numpy as np
from pharmasight.models import MolSample
# From memmap file
sample = MolSample(
sample_id="PDB_1ABC",
memmap_path="mem_map/AF-Q4CKA0-F1-model_v6.npy", # lazy-load on access
metadata={"format": "pdb", "organism": "human"}
)
# Sequence loads from disk only when accessed
vec = sample.sequence # Loads memmap here
print(f"Shape: {vec.shape}, Size: {vec.nbytes/1024/1024:.2f} MB")
# Clear cache if needed
sample.clear_cache()
Real life: Build custom pipelines where sequences are accessed on-demand; avoid unnecessary disk I/O and memory consumption.
6) Fast iteration: memmap-only kernel building
Skip expensive parsing and build kernel directly from pre-parsed memmaps:
from pharmasight import test_memmap_kernel
# Assumes mem_map/ directory contains .npy files
result = test_memmap_kernel.main()
# Loads existing memmaps, builds kernel, saves results in seconds
Real life: Iterate rapidly on similarity search logic without re-parsing 19,000+ files (saves 30+ minutes per test cycle).
7) Mixed format handling with automatic normalization
Compare samples across different file formats seamlessly:
from pharmasight.extract import AsyncExtractor
extractor = AsyncExtractor(batch_size=25, memmap_dir="mem_map")
# Mix of FASTA, PDB, MZML, SMILES formats
mixed_samples = await extractor.extract_from_directory(
"data_samples/mixed",
recursive=True
)
# All sequences normalized to vectors; ready for kernel/comparison
print(f"Loaded {len(mixed_samples)} samples across mixed formats")
for s in mixed_samples[:3]:
print(f" {s.sample_id} ({s.metadata.get('format')}): {s.sequence.shape}")
Real life: Pharma research combining genomic (FASTA), structural (PDB), and proteomic (MS) data in one analysis pipeline.
Key components
| Component | Purpose | Key Features |
|---|---|---|
| MolSample | Sample container | Lazy-loading, memmap support, metadata, cache management |
| 14 Parsers | File format extraction | Unified extract_all(), async-friendly, normalized output |
| AsyncExtractor | Batch processing | Batch size 25, memory monitoring, gc.collect() between batches |
| KernelMatrix | Similarity search | O(n²) kernel, best_match(), top_k filtering |
Memory & performance tips
- Always use
memmap_dir: Offload sequence vectors to disk for 19,000+ samples. - Batch size tuning: Default 25 files per batch; reduce if RAM-constrained, increase for better throughput.
- Lazy-loading by default: Sequences load on-demand; avoid premature full loads.
- Memory monitoring: Logs show memory deltas per batch; watch for unexpected growth.
- Cleanup explicitly: Call
kernel.cleanup()andsample.clear_cache()when done.
Testing
test.py— Full pipeline: parse all data, build kernel, search targets.test2.py— Two-directory workflow: kernel_dir (reference) + target_dir (queries).test_memmap_kernel.py— Fast iteration: build kernel from existing memmaps only.
Contributing
We value contributions! Areas of interest:
- New file format parsers (return
{"sequence": ..., "metadata": ...}fromextract_all()). - Performance optimizations for KernelMatrix (vectorization, distributed compute).
- Additional similarity metrics beyond cosine.
- Visualization tools for similarity results.
- Documentation and usage examples.
- Additional testing for mass-spectrometry formats
- Increasing efficiency for feature-dimensional vectorization features of file format-specific parsers.
Known limitations
- Variable-length sequences padded/truncated to base_length (first sample's length).
- KernelMatrix requires sequences of identical dimensionality.
- Memory usage is O(n²) for n samples in kernel; disk buffering via memmap mitigates but doesn't eliminate.
- Lazy-loading depends on memmap file availability; memmaps are read-only.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pharmasight-0.1.1.tar.gz.
File metadata
- Download URL: pharmasight-0.1.1.tar.gz
- Upload date:
- Size: 27.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50764557f62b9d25dde7af9bbbca90c8c10c507bb301bd34a88c7d6dcaf09220
|
|
| MD5 |
cdb4ef7af83d35e7e27e1a11caea1505
|
|
| BLAKE2b-256 |
b206e050ad3e3450d712232643f748c8c510c8bfebf3a984fcf71a43d1441263
|
File details
Details for the file pharmasight-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pharmasight-0.1.1-py3-none-any.whl
- Upload date:
- Size: 28.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d94447d647379e37d23a422deabf3372884e658bb01b92f2e1af37c73e2ba11a
|
|
| MD5 |
4fee29c5319b4ac29eccdc42d35d6407
|
|
| BLAKE2b-256 |
f501933d6ad6fa87b9daba0f15adc1663726f4bf2e3fd0c68b18e787c57f4e0e
|