Privacy-first command-line tool for biotech devs and researchers to analyse sequence data.
Project description
bioai-seq
bioai-seq is a lightweight, developer-friendly command-line tool for basic biological sequence analysis.
It’s part of my journey toward becoming a Bio AI Software Engineer — combining software engineering, biology, and artificial intelligence into practical, accessible tools.
With bioai-seq, you can:
- Run simple analyses on protein or nucleotide sequences from the command line.
- Automatically generate embeddings using ESM-1b.
- Compare sequences against a local Chroma vector database.
- Retrieve biological metadata from public sources.
- Summarize results using a local LLM model for human-readable insights.
Who is it for?
- Students & learners in bioinformatics who want a gentle entry point into sequence analysis without setting up heavy pipelines.
- Software engineers curious about biology, wanting to bridge coding and life sciences.
- AI & ML enthusiasts exploring how embeddings, vector search, and LLMs can be applied to biological problems.
- Researchers who need a lightweight side tool for quick sequence checks.
How it helps
- 🔎 Fast exploration - check what a sequence might be and what it’s related to in seconds.
- 🧠 Contextual insights — every result comes with a human-readable LLM summary.
- 📦 Local-first design — downloads embeddings DB + LLM once, then works offline.
- 🧩 Educational bridge — shows how AI techniques (embeddings, vector DBs, LLMs) can be directly applied to biology.
- 🌍 Open & extensible — MIT/Apache-licensed, free to adapt for your own research or learning.
How to install
1. Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
2. Install bioai-seq
pip install --upgrade bioai-seq
bioseq
Flow Chart
flowchart TD
subgraph CLI[CLI Tool]
A[User Command: analyze] --> B{Check local resources}
B -->|Missing| C[Download Embedding Chroma DB & Local LLM]
B -->|Available| D[Proceed]
C --> D
D --> E[ESM-1b API: Create Embedding]
E --> F[Metadata API: Search Metadata]
F --> G[Chroma DB: Store & Compare Embeddings]
G --> H[Local LLM: Generate Summary]
H --> I[Display Results to User]
end
🧪 Planned Example Output
✅ Sequence loaded: 1273 amino acids
🧬 Detected: SARS-CoV-2 spike glycoprotein (likely variant: Omicron)
🔍 Running ESM-2 embeddings...
📦 Comparing against 1000 proteins in vector database...
📚 Top similar sequences:
- UniProt P0DTC2 (99.8%) — SARS-CoV-2 spike glycoprotein
- UniProt A0A6H2L9T9 (98.9%) — Bat coronavirus spike protein
- UniProt A0A2X1VPJ6 (97.5%) — Pangolin coronavirus S protein
------------------------------------------------------------
🔬 Matched Protein Metadata: P0DTC2
🌍 Organism: SARS-CoV-2
🧬 Gene names: S, spike
🧫 Host organisms: Human, Bat
📖 Description: Spike glycoprotein mediates viral entry via ACE2
🏷️ Keywords: Receptor-binding, Glycoprotein, Fusion protein
🔎 Protein evidence: Evidence at protein level
🧩 Features:
- Signal peptide: 1–13
- Transmembrane region: 1213–1237
- RBD domain: 319–541
🔗 External references:
- [PDB: 6VSB](https://www.rcsb.org/structure/6VSB)
- [RefSeq: YP_009724390.1](https://www.ncbi.nlm.nih.gov/protein/YP_009724390.1)
- [Pfam: PF01601](https://www.ebi.ac.uk/interpro/entry/pfam/PF01601)
- [AlphaFold model](https://alphafold.ebi.ac.uk/entry/P0DTC2)
- [UniProt entry](https://www.uniprot.org/uniprotkb/P0DTC2)
------------------------------------------------------------
🧠 Summary:
"This sequence matches the SARS-CoV-2 spike glycoprotein. It binds to the ACE2 receptor to mediate viral entry. The receptor binding domain (RBD) spans residues 319–541 and contains key mutations in Omicron variants. The protein is expressed in humans and bats."
Deploying to PyPI (Production)
1. Clean previous builds
rm -rf dist build *.egg-info
2. Build the package
python3 -m build
3. Upload to PyPI
pip install --upgrade twine
twine upload dist/*
- Username:
__token__ - Password: your API token from https://pypi.org/manage/account/token/
Follow the Journey
- 🌍 Blog: https://bioaisoftware.engineer
- 🧑💻 GitHub: https://github.com/babilonczyk
- 💼 LinkedIn: https://www.linkedin.com/in/jan-piotrzkowski/
License
Apache 2.0 - free to use, and improve.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bioai_seq-0.0.4.tar.gz.
File metadata
- Download URL: bioai_seq-0.0.4.tar.gz
- Upload date:
- Size: 14.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be034187e603ac10b3f327c4a1c4808fae8e23e68d30a5154f49cb17e5e3c3e0
|
|
| MD5 |
39f825cdd063abd930612adb0949e11f
|
|
| BLAKE2b-256 |
b208a498858a366f9fafe92f0e7588ab749a70af6bf17e077d03df6d7e2ea437
|
File details
Details for the file bioai_seq-0.0.4-py3-none-any.whl.
File metadata
- Download URL: bioai_seq-0.0.4-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6b178f08abf0ef1c4b03653d3d5e086ace98659fdd3698eca813b8998814162
|
|
| MD5 |
1384f4d8dadd3e09da85d7e73f8ba087
|
|
| BLAKE2b-256 |
e4a3006d4ccdee58adf828126fed6300eb9d3b5a10634991b2de47ec00915d94
|