Indexing assemblies with autoencoders and FCGR
Project description
panspace
Related Article PanSpace: Fast and Scalable Indexing for Massive Bacterial Databases
panspace is a library for creating and querying vector based indexes for bacterial genome (draft) assemblies.
panspace pipeline for querying works as follows,
- First, each genome is represented by its Frequency matrix of the Chaos Game Representation of DNA (FCGR)
- Then, the FCGR is mapped to a n-dimensional vector, the embedding, using a Convolutional Neural Network called
CNNFCGR, the Encoder, - Finally, the embedding --the compressed representation of the input genome-- is used to query an index of these vectors representing a bacterial pangenome.
The library is based on tensorflow and faiss index.
Available indexes
Download the index and encoder
Inside each .zip file you will find
- Encoder:
checkpoints/<name-model>.keras - Index:
index/panspace.index, and in the same folder some json files with metadata (labels)
| Kmer | Embedding Size | File index |
|---|---|---|
| 8 | 256 | triplet_semihard_loss-ranger-0.5-hq-256-CNNFCGR_Levels-level1-clip80.zip (Best) |
| 6,7,8 | 128,256,512 | check others... |
We provide a snakemake pipeline to query a collection of genomes (from a folder), if the environment was installed with conda
from the .yml file, then snakemake was installed.
After decompressing the .zip you will find two folders: checkpoints and index with data corresponding to the encoder (.keras), the FAISS index (.index) and label metadata (.json).
You need the path to the .kerasfile and to panspace.index
.
├── checkpoints
│ └── weights-CNNFCGR_Levels.keras
└── index
├── embeddings.npy
├── id_embeddings.json
├── labels.json
└── panspace.index
Try panspace queries for single files
Clone the repository
git clone https://github.com/pg-space/panspace.git
cd panspace
with CPU support
conda env create -f envs/cpu.yml
conda activate panspace-cpu
or with GPU support
conda env create -f envs/gpu.yml
conda activate panspace-gpu
Then run the streamlit app
panspace app
NOTE that in the environments will be installed the workflow management snakemake, which is needed to run queries efficiently as we will see next.
Query index from a folder of files
We can query the index with
panspace query-smk \
--dir-sequences "<path/to/folder>" \
--path-encoder "<path/to/checkopoints/weights.keras>" \
--path-index "<path/to/panspace.index>"
Note this is just a parser to a snakemake pipeline.
If the FCGR extension to KMC3 is installed, we can use the flag --fast-version to speed up the creation of FCGRs.
for more, check
panspace query-smk --help
Using snakemake directly, we first need to
-
set parameters in
scripts/config_query.yml,- directory with sequences (accepted extensions
.fa.gz,.fa,.fna) - define an output directory to save query results
- gpu or cpu usage
- path to the encoder (
<path/to/encoder>.keras) - path to the index (
<path/to/panspace-index>.index)
- directory with sequences (accepted extensions
-
and run
snakemake -s scripts/query.smk --cores 8 --use-conda
Optional: for faster queries recommended if you have hundreds or thousands of assemblies to query
First install the FCGR extension to KMC3
and put the path to the installed bin of the fcgr tool in the scripts/config_fcgr.yml file. Then run,
snakemake -s scripts/query_fast.smk --cores 8 --use-conda
or put it directly on bash
snakemake -s scripts/query_fast.smk --cores 8 --use-conda --config fcgr_bin=<path/to/fcgr>
NOTES
- change the number of cores (
--cores <NUM_CORES>) if you have more availables, this will allow the parallelization of k-mer counts from assemblies done by KMC3 (by defaultkmc_threads: 2, seescripts/config.yml). - This extension constructs FCGR representations with a C++ extending KMC3 output. The default version parses the output of KMC as a dictionary of k-mer counts and then uses the python library ComplexCGR for the construction of the FCGR.
Create your own encoder and index
NOTE you can skip [step 2] creating the encoder and use the one trained by us (.keras). In this case,
you can try to index your dataset [step 3] (you still need to create the FCGRs though [step 1]).
Install the package
panspace requires python >= 3.9, < 3.11.
with CPU support
pip install "panspace[cpu] @ git+https://github.com/pg-space/panspace.git"
with GPU support
pip install "panspace[gpu] @ git+https://github.com/pg-space/panspace.git"
Install from conda environment (suggested)
with CPU support
conda env create -f envs/cpu.yml
conda activate panspace-cpu
with GPU support
conda env create -f envs/gpu.yml
conda activate panspace-gpu
this will also install snakemake.
step-by-step guide
CLI
It provides commands for
- creating FCGR from kmer counts,
- train an encoder using metric learning (if labels are available) or an autoencoder,
- create and query an Index of embeddings.
>> panspace --help
Usage: panspace [OPTIONS] COMMAND [ARGS]...
🐱 Welcome to panspace (version 0.2.0), a tool for Indexing and Querying a bacterial
pan-genome based on embeddings
╭─ Options ───────────────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or │
│ customize the installation. │
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ──────────────────────────────────────────────────────────────────────────────╮
│ app Run streamlit app │
│ data-curation Find outliers and mislabaled samples. │
│ docs Open documentation webpage. │
│ fcgr Create FCGRs from fasta file or from txt file with kmers and counts. │
│ index Create and query index. Utilities to test index. │
│ query-smk Run the Snakemake pipeline with the specified configuration. │
│ stats-assembly N50, number of contigs, avg length, total length. │
│ trainer Train Autoencoder/Metric Learning. Utilities. │
│ utils Extract info from text or log files │
│ what-to-do 🐱 If you are new here, check this step-by-step guide │
╰─────────────────────────────────────────────────────────────────────────────────────────╯
1. Create FCGR of assemblies
Even though you can use the following command to create a FCGR (.npy file) from a fasta file (and more)
panspace fcgr --help (panspace-cpu)
Usage: panspace fcgr [OPTIONS] COMMAND [ARGS]...
Create FCGRs from fasta file or from txt file with kmers and counts.
╭─ Options ───────────────────────────────────────────────────────────────────────────────╮
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ──────────────────────────────────────────────────────────────────────────────╮
│ from-fasta Create the Frequency matrix of CGR (FCGR) from a fasta file. │
│ from-kmer-counts Create the Frequency matrix of CGR (FCGR) from k-mer counts. │
│ to-image Save FCGR as image from npy file. │
╰─────────────────────────────────────────────────────────────────────────────────────────╯
we suggest that for large datasets, such as AllTheBacteria, is better to rely on specialized kmer counters, such as KMC3 or Jellyfish.
We provide snakemake pipelines to create FCGRs (see scripts/), from:
- from a folder containing
.fa.gzfiles - from a folder containing
.fafiles - AllTheBacteria dataset
Pipelines relies on KMC3 for k-mer counting, and an extension of it to create FCGRs: fcgr. The later needs to be installed manually before using the snakemake pipelines. You do not need to worry about installing KMC3, the snakemake pipelines handles that.
2. Train an encoder to create the vector representations
- Split dataset into train, validation and test sets
panspace trainer split-dataset --help
- Train
Options
- Do you have labels for each assembly?
- Use metric learning with the triplet loss
- Or metric learning with the contrastive loss
- If you do not have labels, then use unsupervised learning with the
AutoencoderFCGRarchitecture In all of them theCNNFCGRarchitecture can be used
panspace trainer metric-learning --help # triplet loss
panspace trainer one-shot --help # contrastive loss
panspace trainer autoencoder --help
Get the Encoder
- If using the triplet loss, the output model is the encoder.
- If using the contrastive loss, you can get the encoder with
panspace trainer extract-backbone-one-shot - If using the autoencoder, you can get the encoder with
panspace trainer split-autoencoder
3. Create and query an index
- Create Index
panspace index create --help
- Query Index
If querying is done from FCGR in numpy format, then use
panspace index query --help
but if you want to query the index directly from assemblies, we encourage you to use the snakemake pipelines provided above.
Author
panspace is developed by Jorge Avila Cartes
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file panspace-0.2.0.tar.gz.
File metadata
- Download URL: panspace-0.2.0.tar.gz
- Upload date:
- Size: 106.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98a6049b4de91e53eafe9b8e9b00c964f5cc2e3c40d66148bf3793d762efccba
|
|
| MD5 |
76336a13ae189a41d2373d4ed06d8151
|
|
| BLAKE2b-256 |
f1f16a88966d91a33760ccf85aa76831b6516f462aab514bbc087c9e9dbfc741
|
File details
Details for the file panspace-0.2.0-py3-none-any.whl.
File metadata
- Download URL: panspace-0.2.0-py3-none-any.whl
- Upload date:
- Size: 101.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bcf98313fccd0ca24e8201ed1081c521f242defd4008c514959cd5d062f51753
|
|
| MD5 |
4bd9fd42ac6b7e99663555025a621692
|
|
| BLAKE2b-256 |
ca1bb4a9e5096d5d7fd48c85b17bc47939255cafe62ac7d6545da3e2ba781e58
|