Skip to main content

Privacy-first command-line tool for biotech devs and researchers to analyse sequence data.

Project description

bioai-seq

bioai-seq is a lightweight, developer-friendly command-line tool for basic biological sequence analysis.
It’s part of my journey toward becoming a Bio AI Software Engineer — combining software engineering, biology, and artificial intelligence into practical, accessible tools.

With bioai-seq, you can:

  • Run simple analyses on protein or nucleotide sequences from the command line.
  • Automatically generate embeddings using ESM-1b.
  • Compare sequences against a local Chroma vector database.
  • Retrieve biological metadata from public sources.
  • Summarize results using a local LLM model for human-readable insights.

Who is it for?

  • Students & learners in bioinformatics who want a gentle entry point into sequence analysis without setting up heavy pipelines.
  • Software engineers curious about biology, wanting to bridge coding and life sciences.
  • AI & ML enthusiasts exploring how embeddings, vector search, and LLMs can be applied to biological problems.
  • Researchers who need a lightweight side tool for quick sequence checks.

How it helps

  • 🔎 Fast exploration - check what a sequence might be and what it’s related to in seconds.
  • 🧠 Contextual insights — every result comes with a human-readable LLM summary.
  • 📦 Local-first design — downloads embeddings DB + LLM once, then works offline.
  • 🧩 Educational bridge — shows how AI techniques (embeddings, vector DBs, LLMs) can be directly applied to biology.
  • 🌍 Open & extensible — MIT/Apache-licensed, free to adapt for your own research or learning.

How to install

1. Create and activate a virtual environment

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

2. Install bioai-seq

pip install --upgrade bioai-seq
bioseq

Flow Chart

flowchart TD
  subgraph CLI[CLI Tool]
    A[User Command: analyze] --> B{Check local resources}
    B -->|Missing| C[Download Embedding Chroma DB & Local LLM]
    B -->|Available| D[Proceed]
    C --> D
    D --> E[ESM-1b API: Create Embedding]
    E --> F[Metadata API: Search Metadata]
    F --> G[Chroma DB: Store & Compare Embeddings]
    G --> H[Local LLM: Generate Summary]
    H --> I[Display Results to User]
  end

🧪 Planned Example Output

✅ Sequence loaded: 1273 amino acids
🧬 Detected: SARS-CoV-2 spike glycoprotein (likely variant: Omicron)

🔍 Running ESM-2 embeddings...
📦 Comparing against 1000 proteins in vector database...
📚 Top similar sequences:
 - UniProt P0DTC2 (99.8%) — SARS-CoV-2 spike glycoprotein
 - UniProt A0A6H2L9T9 (98.9%) — Bat coronavirus spike protein
 - UniProt A0A2X1VPJ6 (97.5%) — Pangolin coronavirus S protein

------------------------------------------------------------

🔬 Matched Protein Metadata: P0DTC2
🌍 Organism: SARS-CoV-2
🧬 Gene names: S, spike
🧫 Host organisms: Human, Bat
📖 Description: Spike glycoprotein mediates viral entry via ACE2
🏷️ Keywords: Receptor-binding, Glycoprotein, Fusion protein
🔎 Protein evidence: Evidence at protein level

🧩 Features:
 - Signal peptide: 1–13
 - Transmembrane region: 1213–1237
 - RBD domain: 319–541

🔗 External references:
 - [PDB: 6VSB](https://www.rcsb.org/structure/6VSB)
 - [RefSeq: YP_009724390.1](https://www.ncbi.nlm.nih.gov/protein/YP_009724390.1)
 - [Pfam: PF01601](https://www.ebi.ac.uk/interpro/entry/pfam/PF01601)
 - [AlphaFold model](https://alphafold.ebi.ac.uk/entry/P0DTC2)
 - [UniProt entry](https://www.uniprot.org/uniprotkb/P0DTC2)

------------------------------------------------------------

🧠 Summary:
"This sequence matches the SARS-CoV-2 spike glycoprotein. It binds to the ACE2 receptor to mediate viral entry. The receptor binding domain (RBD) spans residues 319–541 and contains key mutations in Omicron variants. The protein is expressed in humans and bats."

Deploying to PyPI (Production)

1. Clean previous builds

rm -rf dist build *.egg-info

2. Build the package

python3 -m build

3. Upload to PyPI

pip install --upgrade twine
twine upload dist/*

Follow the Journey


License

Apache 2.0 - free to use, and improve.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioai_seq-0.0.4.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bioai_seq-0.0.4-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file bioai_seq-0.0.4.tar.gz.

File metadata

  • Download URL: bioai_seq-0.0.4.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for bioai_seq-0.0.4.tar.gz
Algorithm Hash digest
SHA256 be034187e603ac10b3f327c4a1c4808fae8e23e68d30a5154f49cb17e5e3c3e0
MD5 39f825cdd063abd930612adb0949e11f
BLAKE2b-256 b208a498858a366f9fafe92f0e7588ab749a70af6bf17e077d03df6d7e2ea437

See more details on using hashes here.

File details

Details for the file bioai_seq-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: bioai_seq-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for bioai_seq-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c6b178f08abf0ef1c4b03653d3d5e086ace98659fdd3698eca813b8998814162
MD5 1384f4d8dadd3e09da85d7e73f8ba087
BLAKE2b-256 e4a3006d4ccdee58adf828126fed6300eb9d3b5a10634991b2de47ec00915d94

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page