Skip to main content

Indexing assemblies with autoencoders and FCGR

Project description

panspace

Related Article PanSpace: Fast and Scalable Indexing for Massive Bacterial Databases

panspace is a library for creating and querying vector based indexes for bacterial genome (draft) assemblies.

panspace pipeline for querying works as follows,

  1. First, each genome is represented by its Frequency matrix of the Chaos Game Representation of DNA (FCGR)
  2. Then, the FCGR is mapped to a n-dimensional vector, the embedding, using a Convolutional Neural Network called CNNFCGR, the Encoder,
  3. Finally, the embedding --the compressed representation of the input genome-- is used to query an index of these vectors representing a bacterial pangenome.

The library is based on tensorflow and faiss index.

Available indexes

Download the index and encoder

Inside each .zip file you will find

  • Encoder: checkpoints/<name-model>.keras
  • Index: index/panspace.index, and in the same folder some json files with metadata (labels)
Kmer Embedding Size File index
8 256 triplet_semihard_loss-ranger-0.5-hq-256-CNNFCGR_Levels-level1-clip80.zip (Best)
6,7,8 128,256,512 check others...

We provide a snakemake pipeline to query a collection of genomes (from a folder), if the environment was installed with conda from the .yml file, then snakemake was installed.

After decompressing the .zip you will find two folders: checkpoints and index with data corresponding to the encoder (.keras), the FAISS index (.index) and label metadata (.json). You need the path to the .kerasfile and to panspace.index

.
├── checkpoints
│   └── weights-CNNFCGR_Levels.keras
└── index
    ├── embeddings.npy
    ├── id_embeddings.json
    ├── labels.json
    └── panspace.index

Try panspace queries for single files

panspace app demo

Clone the repository

git clone https://github.com/pg-space/panspace.git
cd panspace

with CPU support

conda env create -f envs/cpu.yml
conda activate panspace-cpu

or with GPU support

conda env create -f envs/gpu.yml
conda activate panspace-gpu

Then run the streamlit app

panspace app

NOTE that in the environments will be installed the workflow management snakemake, which is needed to run queries efficiently as we will see next.

Query index from a folder of files


We can query the index with

panspace query-smk \
    --dir-sequences "<path/to/folder>" \
    --path-encoder "<path/to/checkopoints/weights.keras>" \
    --path-index "<path/to/panspace.index>"

Note this is just a parser to a snakemake pipeline.

If the FCGR extension to KMC3 is installed, we can use the flag --fast-version to speed up the creation of FCGRs.

for more, check

panspace query-smk --help

Using snakemake directly, we first need to

  1. set parameters in scripts/config_query.yml,

    • directory with sequences (accepted extensions .fa.gz, .fa, .fna)
    • define an output directory to save query results
    • gpu or cpu usage
    • path to the encoder (<path/to/encoder>.keras)
    • path to the index (<path/to/panspace-index>.index)
  2. and run

snakemake -s scripts/query.smk --cores 8 --use-conda

Optional: for faster queries recommended if you have hundreds or thousands of assemblies to query

First install the FCGR extension to KMC3 and put the path to the installed bin of the fcgr tool in the scripts/config_fcgr.yml file. Then run,

snakemake -s scripts/query_fast.smk --cores 8 --use-conda

or put it directly on bash

snakemake -s scripts/query_fast.smk --cores 8 --use-conda --config fcgr_bin=<path/to/fcgr>

NOTES

  • change the number of cores (--cores <NUM_CORES>) if you have more availables, this will allow the parallelization of k-mer counts from assemblies done by KMC3 (by default kmc_threads: 2, see scripts/config.yml).
  • This extension constructs FCGR representations with a C++ extending KMC3 output. The default version parses the output of KMC as a dictionary of k-mer counts and then uses the python library ComplexCGR for the construction of the FCGR.

Create your own encoder and index

NOTE you can skip [step 2] creating the encoder and use the one trained by us (.keras). In this case, you can try to index your dataset [step 3] (you still need to create the FCGRs though [step 1]).

Install the package

panspace requires python >= 3.9, < 3.11.

with CPU support

pip install "panspace[cpu] @ git+https://github.com/pg-space/panspace.git"

with GPU support

pip install "panspace[gpu] @ git+https://github.com/pg-space/panspace.git"

Install from conda environment (suggested)

with CPU support

conda env create -f envs/cpu.yml
conda activate panspace-cpu

with GPU support

conda env create -f envs/gpu.yml
conda activate panspace-gpu

this will also install snakemake.

step-by-step guide

CLI

It provides commands for

  • creating FCGR from kmer counts,
  • train an encoder using metric learning (if labels are available) or an autoencoder,
  • create and query an Index of embeddings.
>> panspace --help                                             
                                                                                           
 Usage: panspace [OPTIONS] COMMAND [ARGS]...                                               
                                                                                           
 🐱 Welcome to panspace (version 0.2.0), a tool for Indexing and Querying a bacterial      
 pan-genome based on embeddings                                                            
                                                                                           
╭─ Options ───────────────────────────────────────────────────────────────────────────────╮
│ --install-completion          Install completion for the current shell.                 │
│ --show-completion             Show completion for the current shell, to copy it or      │
│                               customize the installation.                               │
│ --help                        Show this message and exit.                               │
╰─────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ──────────────────────────────────────────────────────────────────────────────╮
│ app              Run streamlit app                                                      │
│ data-curation    Find outliers and mislabaled samples.                                  │
│ docs             Open documentation webpage.                                            │
│ fcgr             Create FCGRs from fasta file or from txt file with kmers and counts.   │
│ index            Create and query index. Utilities to test index.                       │
│ query-smk        Run the Snakemake pipeline with the specified configuration.           │
│ stats-assembly   N50, number of contigs, avg length, total length.                      │
│ trainer          Train Autoencoder/Metric Learning. Utilities.                          │
│ utils            Extract info from text or log files                                    │
│ what-to-do       🐱 If you are new here, check this step-by-step guide                  │
╰─────────────────────────────────────────────────────────────────────────────────────────╯

1. Create FCGR of assemblies

Even though you can use the following command to create a FCGR (.npy file) from a fasta file (and more)

panspace fcgr --help                                             (panspace-cpu) 
                                                                                           
 Usage: panspace fcgr [OPTIONS] COMMAND [ARGS]...                                          
                                                                                           
 Create FCGRs from fasta file or from txt file with kmers and counts.                      
                                                                                           
╭─ Options ───────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                             │
╰─────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ──────────────────────────────────────────────────────────────────────────────╮
│ from-fasta         Create the Frequency matrix of CGR (FCGR) from a fasta file.         │
│ from-kmer-counts   Create the Frequency matrix of CGR (FCGR) from k-mer counts.         │
│ to-image           Save FCGR as image from npy file.                                    │
╰─────────────────────────────────────────────────────────────────────────────────────────╯

we suggest that for large datasets, such as AllTheBacteria, is better to rely on specialized kmer counters, such as KMC3 or Jellyfish.

We provide snakemake pipelines to create FCGRs (see scripts/), from:

Pipelines relies on KMC3 for k-mer counting, and an extension of it to create FCGRs: fcgr. The later needs to be installed manually before using the snakemake pipelines. You do not need to worry about installing KMC3, the snakemake pipelines handles that.

2. Train an encoder to create the vector representations

  1. Split dataset into train, validation and test sets
panspace trainer split-dataset --help
  1. Train

Options

  • Do you have labels for each assembly?
    • Use metric learning with the triplet loss
    • Or metric learning with the contrastive loss
  • If you do not have labels, then use unsupervised learning with the AutoencoderFCGR architecture In all of them the CNNFCGR architecture can be used
panspace trainer metric-learning --help # triplet loss
panspace trainer one-shot --help        # contrastive loss
panspace trainer autoencoder --help     

Get the Encoder

  • If using the triplet loss, the output model is the encoder.
  • If using the contrastive loss, you can get the encoder with panspace trainer extract-backbone-one-shot
  • If using the autoencoder, you can get the encoder with panspace trainer split-autoencoder

3. Create and query an index

  1. Create Index
panspace index create --help
  1. Query Index

If querying is done from FCGR in numpy format, then use

panspace index query --help

but if you want to query the index directly from assemblies, we encourage you to use the snakemake pipelines provided above.


Author

panspace is developed by Jorge Avila Cartes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

panspace-0.2.0.tar.gz (106.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

panspace-0.2.0-py3-none-any.whl (101.3 kB view details)

Uploaded Python 3

File details

Details for the file panspace-0.2.0.tar.gz.

File metadata

  • Download URL: panspace-0.2.0.tar.gz
  • Upload date:
  • Size: 106.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.1

File hashes

Hashes for panspace-0.2.0.tar.gz
Algorithm Hash digest
SHA256 98a6049b4de91e53eafe9b8e9b00c964f5cc2e3c40d66148bf3793d762efccba
MD5 76336a13ae189a41d2373d4ed06d8151
BLAKE2b-256 f1f16a88966d91a33760ccf85aa76831b6516f462aab514bbc087c9e9dbfc741

See more details on using hashes here.

File details

Details for the file panspace-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: panspace-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 101.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.1

File hashes

Hashes for panspace-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bcf98313fccd0ca24e8201ed1081c521f242defd4008c514959cd5d062f51753
MD5 4bd9fd42ac6b7e99663555025a621692
BLAKE2b-256 ca1bb4a9e5096d5d7fd48c85b17bc47939255cafe62ac7d6545da3e2ba781e58

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page