Asynchronous genomic comparison and visualization toolkit to perform on local devices(parsers, loaders, matchers, visualizers).

These details have not been verified by PyPI

Project links

Homepage

Project description

Biotronics Ai

PharmaSight

An open-source asynchronous proteomic and molecular comparison toolkit built by Biotronics AI. It supports multi-format parsing, efficient memory management with lazy-loading, and local similarity matching for pharma, drug repurposing/repositioning and molecular analysis workflows.

Documentation & Resources:

GitHub Repository: Biotronics-Ai/PharmaSight
Research Paper: PharmaSight Preprint

What it delivers in practice?

Multi-format molecular comparison within seconds: Parse 14+ file formats (PDB, MMCIF, FASTA, MZML, SMILES, InChI, and more), stream in batches, eliminate weak candidates fast, then score the top matches with kernel-based similarity search.
Memory-efficient batch processing: Lazy-load sequences from memory-mapped files, process in configurable batches (default 25), and maintain stable RAM usage even with 20,000+ samples.
Unified extraction pipeline: Single-pass extract_all() method across all parsers normalizes sequence and metadata extraction, reducing code duplication and improving performance.
Multi-format molecular handling: Normalize heterogeneous sequence formats (genomic, proteomic, cheminformatic) to be compatible and directly comparable.

Features

Supported Data File Formats:

Sequence: FASTA, FASTQ, GenBank formats.
Structure: PDB (Protein Data Bank), MMCIF (mmCIF).
Mass Spectrometry: MZML, MZXML, MZIdentML, MZTab.
Genomics: PepXML.
Chemistry: SMILES, InChI, SDF, MOL files.
Network: EdgeList, BioPAX RDF.

Core Features:

MolSample (lazy-loading with memmap read-only mode)
Unified extract_all() extraction across 14 parser classes
Batch processing with memory monitoring (psutil integration)
KernelMatrix for fast similarity search and best-match discovery
Async/concurrent file parsing with garbage collection between batches
Flexible metadata extraction with normalized output format

Memory-aware architecture:

Lazy-loading: Store only file paths initially, load sequences on-demand
Memory monitoring: Track RSS memory before/after each batch
Batch processing: Process 25 files concurrently, then gc.collect()
Memmap buffers: Read-only access (mmap_mode='r') for minimal overhead

Install

python -m venv .venv
source .venv/bin/activate
pip install pharmasight

Dependencies are defined in pyproject.toml / requirements.txt.

Project layout

models.py — MolSample class, lazy-loading with memmap support.
parsers.py — 14 file-format parsers with unified extract_all() method.
kernel.py — KernelMatrix for similarity search and matching.
base.py — base components and utility helpers.

Quick start

1) Unified extraction across multiple formats

All parsers implement the same extract_all() interface for consistent, single-pass extraction:

from pharmasight.parsers import FASTAParser, PDBStructureParser

# FASTA example
fasta_parser = FASTAParser()
result = fasta_parser.extract_all(raw_fasta_data)
sequence = result["sequence"]
metadata = result["metadata"]

# PDB example
pdb_parser = PDBStructureParser()
result = pdb_parser.extract_all(raw_pdb_data)
structure_vectors = result["sequence"]
pdb_metadata = result["metadata"]

Real life: Parse genomic sequences (FASTA), protein structures (PDB), and mass spectrometry data (MZML) in a single, unified pipeline without format-specific extraction logic.

2) Batch processing with memory monitoring

Load and parse large sample collections efficiently with automatic memory management:

from pharmasight.extract import AsyncExtractor
from pharmasight.models import MolSample

extractor = AsyncExtractor(
    batch_size=25,
    memmap_dir="mem_map",
    logs_dir="logs"
)

samples = await extractor.extract_from_directory(
    directory_path="data_samples/trcc_pdb",
    recursive=True
)

# Samples are lazy-loaded; sequences stored as memmap file paths
for sample in samples:
    # Access sequence only when needed (lazy-load)
    seq = sample.sequence
    print(f"{sample.sample_id}: {seq.shape}")

Real life: Process 19,000+ molecular structures without exhausting RAM; batch operations with memory delta logging ensure stable performance.

3) Two-directory similarity workflow

Build a reference kernel from one directory, then search against samples in another:

from pharmasight.kernel import KernelMatrix
from pharmasight.extract import AsyncExtractor

# Phase 1: Build kernel from reference directory
extractor = AsyncExtractor(batch_size=25, memmap_dir="mem_map")
kernel_samples = await extractor.extract_from_directory("data_samples/trcc_pdb")
kernel = KernelMatrix(kernel_samples, memmap_dir="mem_map", logs_dir="logs")

# Phase 2: Search target samples in kernel
target_samples = await extractor.extract_from_directory("data_samples/trcc")
results = {}
for target in target_samples:
    best_matches = kernel.best_match(target.sample_id, top_k=5)
    results[target.sample_id] = best_matches

Real life: Compare query molecules against a pharmaceutical database; find top-5 closest structural matches within seconds.

4) Kernel matrix with lazy-loaded sequences

Use KernelMatrix for fast similarity search across large sample pools without loading everything into RAM:

from pharmasight.kernel import KernelMatrix

kernel = KernelMatrix(
    samples=samples,
    memmap_dir="mem_map",
    logs_dir="logs"
)

# Find top-5 most similar samples to a query
best_matches = kernel.best_match("query_sample_id", top_k=5)
print(f"Top matches: {best_matches}")

kernel.cleanup()

⚠️ Memory note: Kernel matrix builds are based on sequence length and sample count. Set memmap_dir for disk buffering and monitor memory usage via psutil. Ensure sequences are normalized to consistent dimensions (padded/truncated to base_length).

5) Direct MolSample creation with lazy-loading

Create molecular samples manually for custom workflows:

import numpy as np
from pharmasight.models import MolSample

# From memmap file
sample = MolSample(
    sample_id="PDB_1ABC",
    memmap_path="mem_map/AF-Q4CKA0-F1-model_v6.npy",  # lazy-load on access
    metadata={"format": "pdb", "organism": "human"}
)

# Sequence loads from disk only when accessed
vec = sample.sequence  # Loads memmap here
print(f"Shape: {vec.shape}, Size: {vec.nbytes/1024/1024:.2f} MB")

# Clear cache if needed
sample.clear_cache()

Real life: Build custom pipelines where sequences are accessed on-demand; avoid unnecessary disk I/O and memory consumption.

6) Fast iteration: memmap-only kernel building

Skip expensive parsing and build kernel directly from pre-parsed memmaps:

from pharmasight import test_memmap_kernel

# Assumes mem_map/ directory contains .npy files
result = test_memmap_kernel.main()
# Loads existing memmaps, builds kernel, saves results in seconds

Real life: Iterate rapidly on similarity search logic without re-parsing 19,000+ files (saves 30+ minutes per test cycle).

7) Mixed format handling with automatic normalization

Compare samples across different file formats seamlessly:

from pharmasight.extract import AsyncExtractor

extractor = AsyncExtractor(batch_size=25, memmap_dir="mem_map")

# Mix of FASTA, PDB, MZML, SMILES formats
mixed_samples = await extractor.extract_from_directory(
    "data_samples/mixed",
    recursive=True
)

# All sequences normalized to vectors; ready for kernel/comparison
print(f"Loaded {len(mixed_samples)} samples across mixed formats")
for s in mixed_samples[:3]:
    print(f"  {s.sample_id} ({s.metadata.get('format')}): {s.sequence.shape}")

Real life: Pharma research combining genomic (FASTA), structural (PDB), and proteomic (MS) data in one analysis pipeline.

Key components

Component	Purpose	Key Features
MolSample	Sample container	Lazy-loading, memmap support, metadata, cache management
14 Parsers	File format extraction	Unified `extract_all()`, async-friendly, normalized output
AsyncExtractor	Batch processing	Batch size 25, memory monitoring, gc.collect() between batches
KernelMatrix	Similarity search	O(n²) kernel, best_match(), top_k filtering

Memory & performance tips

Always use memmap_dir: Offload sequence vectors to disk for 19,000+ samples.
Batch size tuning: Default 25 files per batch; reduce if RAM-constrained, increase for better throughput.
Lazy-loading by default: Sequences load on-demand; avoid premature full loads.
Memory monitoring: Logs show memory deltas per batch; watch for unexpected growth.
Cleanup explicitly: Call kernel.cleanup() and sample.clear_cache() when done.

Testing

test.py — Full pipeline: parse all data, build kernel, search targets.
test2.py — Two-directory workflow: kernel_dir (reference) + target_dir (queries).
test_memmap_kernel.py — Fast iteration: build kernel from existing memmaps only.

Contributing

We value contributions! Areas of interest:

New file format parsers (return {"sequence": ..., "metadata": ...} from extract_all()).
Performance optimizations for KernelMatrix (vectorization, distributed compute).
Additional similarity metrics beyond cosine.
Visualization tools for similarity results.
Documentation and usage examples.
Additional testing for mass-spectrometry formats
Increasing efficiency for feature-dimensional vectorization features of file format-specific parsers.

Known limitations

Variable-length sequences padded/truncated to base_length (first sample's length).
KernelMatrix requires sequences of identical dimensionality.
Memory usage is O(n²) for n samples in kernel; disk buffering via memmap mitigates but doesn't eliminate.
Lazy-loading depends on memmap file availability; memmaps are read-only.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.5

Feb 12, 2026

0.1.2

Feb 12, 2026

This version

0.1.1

Feb 12, 2026

0.1.0

Feb 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pharmasight-0.1.1.tar.gz (27.5 kB view details)

Uploaded Feb 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pharmasight-0.1.1-py3-none-any.whl (28.9 kB view details)

Uploaded Feb 12, 2026 Python 3

File details

Details for the file pharmasight-0.1.1.tar.gz.

File metadata

Download URL: pharmasight-0.1.1.tar.gz
Upload date: Feb 12, 2026
Size: 27.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for pharmasight-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`50764557f62b9d25dde7af9bbbca90c8c10c507bb301bd34a88c7d6dcaf09220`
MD5	`cdb4ef7af83d35e7e27e1a11caea1505`
BLAKE2b-256	`b206e050ad3e3450d712232643f748c8c510c8bfebf3a984fcf71a43d1441263`

See more details on using hashes here.

File details

Details for the file pharmasight-0.1.1-py3-none-any.whl.

File metadata

Download URL: pharmasight-0.1.1-py3-none-any.whl
Upload date: Feb 12, 2026
Size: 28.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for pharmasight-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d94447d647379e37d23a422deabf3372884e658bb01b92f2e1af37c73e2ba11a`
MD5	`4fee29c5319b4ac29eccdc42d35d6407`
BLAKE2b-256	`f501933d6ad6fa87b9daba0f15adc1663726f4bf2e3fd0c68b18e787c57f4e0e`

See more details on using hashes here.

PharmaSight 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

PharmaSight

Features

Install

Project layout

Quick start

1) Unified extraction across multiple formats

2) Batch processing with memory monitoring

3) Two-directory similarity workflow

4) Kernel matrix with lazy-loaded sequences

5) Direct MolSample creation with lazy-loading

6) Fast iteration: memmap-only kernel building

7) Mixed format handling with automatic normalization

Key components

Memory & performance tips

Testing

Contributing

Known limitations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes