Indexing assemblies with autoencoders and FCGR

These details have not been verified by PyPI

Project description

`panspace`

panspace is a library for creating and querying vector based indexes for bacterial genome (draft) assemblies.

panspace pipeline for querying works as follows,

First, each genome is represented by its Frequency matrix of the Chaos Game Representation of DNA (FCGR)
Then, the FCGR is mapped to a n-dimensional vector, the embedding, using a Convolutional Neural Network called CNNFCGR, the Encoder,
Finally, the embedding --the compressed representation of the input genome-- is used to query an index of these vectors representing a bacterial pangenome.

The library is based on tensorflow and faiss index.

Available indexes

Download the index and encoder

Inside each .zip file you will find

Encoder: checkpoints/<name-model>.keras
Index: index/panspace.index, and in the same folder some json files with metadata (labels)

Kmer	Embedding Size	File index
8	256	`triplet_semihard_loss-ranger-0.5-hq-256-CNNFCGR_Levels-level1-clip80.zip` (Best)
6,7,8	128,256,512	check others...

We provide a snakemake pipeline to query a collection of genomes (from a folder), if the environment was installed with conda from the .yml file, then snakemake was installed.

After decompressing the .zip you will find two folders: checkpoints and index with data corresponding to the encoder (.keras), the FAISS index (.index) and label metadata (.json). You need the path to the .kerasfile and to panspace.index

.
├── checkpoints
│   └── weights-CNNFCGR_Levels.keras
└── index
    ├── embeddings.npy
    ├── id_embeddings.json
    ├── labels.json
    └── panspace.index

Try `panspace` queries for single files

Clone the repository

git clone https://github.com/pg-space/panspace.git
cd panspace

with CPU support

conda env create -f envs/cpu.yml
conda activate panspace-cpu

or with GPU support

conda env create -f envs/gpu.yml
conda activate panspace-gpu

Then run the streamlit app

panspace app

NOTE that in the environments will be installed the workflow management snakemake, which is needed to run queries efficiently as we will see next.

Query `index` from a folder of files

We can query the index with

panspace query-smk \
    --dir-sequences "<path/to/folder>" \
    --path-encoder "<path/to/checkopoints/weights.keras>" \
    --path-index "<path/to/panspace.index>"

Note this is just a parser to a snakemake pipeline.

If the FCGR extension to KMC3 is installed, we can use the flag --fast-version to speed up the creation of FCGRs.

for more, check

panspace query-smk --help

Using snakemake directly, we first need to

set parameters in scripts/config_query.yml,
- directory with sequences (accepted extensions .fa.gz, .fa, .fna)
- define an output directory to save query results
- gpu or cpu usage
- path to the encoder (<path/to/encoder>.keras)
- path to the index (<path/to/panspace-index>.index)
and run

snakemake -s scripts/query.smk --cores 8 --use-conda

Optional: for faster queries recommended if you have hundreds or thousands of assemblies to query

First install the FCGR extension to KMC3 and put the path to the installed bin of the fcgr tool in the scripts/config_fcgr.yml file. Then run,

snakemake -s scripts/query_fast.smk --cores 8 --use-conda

or put it directly on bash

snakemake -s scripts/query_fast.smk --cores 8 --use-conda --config fcgr_bin=<path/to/fcgr>

NOTES

change the number of cores (--cores <NUM_CORES>) if you have more availables, this will allow the parallelization of k-mer counts from assemblies done by KMC3 (by default kmc_threads: 2, see scripts/config.yml).
This extension constructs FCGR representations with a C++ extending KMC3 output. The default version parses the output of KMC as a dictionary of k-mer counts and then uses the python library ComplexCGR for the construction of the FCGR.

Create your own `encoder` and `index`

NOTE you can skip [step 2] creating the encoder and use the one trained by us (.keras). In this case, you can try to index your dataset [step 3] (you still need to create the FCGRs though [step 1]).

Install the package

panspace requires python >= 3.9, < 3.11.

with CPU support

pip install "panspace[cpu] @ git+https://github.com/pg-space/panspace.git"

with GPU support

pip install "panspace[gpu] @ git+https://github.com/pg-space/panspace.git"

Install from conda environment (suggested)

with CPU support

conda env create -f envs/cpu.yml
conda activate panspace-cpu

with GPU support

conda env create -f envs/gpu.yml
conda activate panspace-gpu

this will also install snakemake.

step-by-step guide

CLI

It provides commands for

creating FCGR from kmer counts,
train an encoder using metric learning (if labels are available) or an autoencoder,
create and query an Index of embeddings.

>> panspace --help                                             
                                                                                           
 Usage: panspace [OPTIONS] COMMAND [ARGS]...                                               
                                                                                           
 🐱 Welcome to panspace (version 0.2.0), a tool for Indexing and Querying a bacterial      
 pan-genome based on embeddings                                                            
                                                                                           
╭─ Options ───────────────────────────────────────────────────────────────────────────────╮
│ --install-completion          Install completion for the current shell.                 │
│ --show-completion             Show completion for the current shell, to copy it or      │
│                               customize the installation.                               │
│ --help                        Show this message and exit.                               │
╰─────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ──────────────────────────────────────────────────────────────────────────────╮
│ app              Run streamlit app                                                      │
│ data-curation    Find outliers and mislabaled samples.                                  │
│ docs             Open documentation webpage.                                            │
│ fcgr             Create FCGRs from fasta file or from txt file with kmers and counts.   │
│ index            Create and query index. Utilities to test index.                       │
│ query-smk        Run the Snakemake pipeline with the specified configuration.           │
│ stats-assembly   N50, number of contigs, avg length, total length.                      │
│ trainer          Train Autoencoder/Metric Learning. Utilities.                          │
│ utils            Extract info from text or log files                                    │
│ what-to-do       🐱 If you are new here, check this step-by-step guide                  │
╰─────────────────────────────────────────────────────────────────────────────────────────╯

1. Create FCGR of assemblies

Even though you can use the following command to create a FCGR (.npy file) from a fasta file (and more)

panspace fcgr --help                                             (panspace-cpu) 
                                                                                           
 Usage: panspace fcgr [OPTIONS] COMMAND [ARGS]...                                          
                                                                                           
 Create FCGRs from fasta file or from txt file with kmers and counts.                      
                                                                                           
╭─ Options ───────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                             │
╰─────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ──────────────────────────────────────────────────────────────────────────────╮
│ from-fasta         Create the Frequency matrix of CGR (FCGR) from a fasta file.         │
│ from-kmer-counts   Create the Frequency matrix of CGR (FCGR) from k-mer counts.         │
│ to-image           Save FCGR as image from npy file.                                    │
╰─────────────────────────────────────────────────────────────────────────────────────────╯

we suggest that for large datasets, such as AllTheBacteria, is better to rely on specialized kmer counters, such as KMC3 or Jellyfish.

We provide snakemake pipelines to create FCGRs (see scripts/), from:

from a folder containing .fa.gz files
from a folder containing .fa files
AllTheBacteria dataset

Pipelines relies on KMC3 for k-mer counting, and an extension of it to create FCGRs: fcgr. The later needs to be installed manually before using the snakemake pipelines. You do not need to worry about installing KMC3, the snakemake pipelines handles that.

2. Train an encoder to create the vector representations

Split dataset into train, validation and test sets

panspace trainer split-dataset --help

Train

Options

Do you have labels for each assembly?
- Use metric learning with the triplet loss
- Or metric learning with the contrastive loss
If you do not have labels, then use unsupervised learning with the AutoencoderFCGR architecture In all of them the CNNFCGR architecture can be used

panspace trainer metric-learning --help # triplet loss
panspace trainer one-shot --help        # contrastive loss
panspace trainer autoencoder --help

Get the Encoder

If using the triplet loss, the output model is the encoder.
If using the contrastive loss, you can get the encoder with panspace trainer extract-backbone-one-shot
If using the autoencoder, you can get the encoder with panspace trainer split-autoencoder

3. Create and query an index

Create Index

panspace index create --help

Query Index

If querying is done from FCGR in numpy format, then use

panspace index query --help

but if you want to query the index directly from assemblies, we encourage you to use the snakemake pipelines provided above.

Author

panspace is developed by Jorge Avila Cartes

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Dec 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

panspace-0.2.0.tar.gz (106.5 kB view details)

Uploaded Dec 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

panspace-0.2.0-py3-none-any.whl (101.3 kB view details)

Uploaded Dec 5, 2025 Python 3

File details

Details for the file panspace-0.2.0.tar.gz.

File metadata

Download URL: panspace-0.2.0.tar.gz
Upload date: Dec 5, 2025
Size: 106.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.1

File hashes

Hashes for panspace-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`98a6049b4de91e53eafe9b8e9b00c964f5cc2e3c40d66148bf3793d762efccba`
MD5	`76336a13ae189a41d2373d4ed06d8151`
BLAKE2b-256	`f1f16a88966d91a33760ccf85aa76831b6516f462aab514bbc087c9e9dbfc741`

See more details on using hashes here.

File details

Details for the file panspace-0.2.0-py3-none-any.whl.

File metadata

Download URL: panspace-0.2.0-py3-none-any.whl
Upload date: Dec 5, 2025
Size: 101.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.1

File hashes

Hashes for panspace-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bcf98313fccd0ca24e8201ed1081c521f242defd4008c514959cd5d062f51753`
MD5	`4bd9fd42ac6b7e99663555025a621692`
BLAKE2b-256	`ca1bb4a9e5096d5d7fd48c85b17bc47939255cafe62ac7d6545da3e2ba781e58`

See more details on using hashes here.

panspace 0.2.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

`panspace`

Available indexes

Try `panspace` queries for single files

Query `index` from a folder of files

Using snakemake directly, we first need to

Create your own `encoder` and `index`

Install the package

Install from conda environment (suggested)

step-by-step guide

CLI

1. Create FCGR of assemblies

2. Train an encoder to create the vector representations

3. Create and query an index

Author

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

panspace 0.2.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

panspace

Available indexes

Try panspace queries for single files

Query index from a folder of files

Using snakemake directly, we first need to

Create your own encoder and index

Install the package

Install from conda environment (suggested)

step-by-step guide

CLI

1. Create FCGR of assemblies

2. Train an encoder to create the vector representations

3. Create and query an index

Author

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`panspace`

Try `panspace` queries for single files

Query `index` from a folder of files

Create your own `encoder` and `index`