Skip to main content

A pacakge which provides various ways to analyze NGS data from phage display campaigns

Project description

Welcome to ExpoSeq

ExpoSeq is a powerful pipeline for processing and analyzing FASTQ files from highthroughput sequencing of samples from display techniques such as phage display. It utilizes MiXCR to align and assemble the data which you can subsequently analyze in multiple plots. The pipeline focuses on analysing the identity between samples but also applies various clustering techniques to analyse the relation between the sequences. Besides, you can add binding data to relate the clusters to affinity.

Installation

Make sure you have installed Python on your system. After that you can install ExpoSeq in the terminal with

pip install ExpoSeq

Ensure that you have python > 3.9 installed.

To get started, please download and follow the instructions for MiXCR at their official documentation You can also only use the test version of ExpoSeq without installing it.

Importing the Plotting Tool

To access the plotting tool, you will need to import it into your console by running the following command:

from ExpoSeq.pipeline import PlotManager

The PlotManager is the main interface for creating various plots using your FASTQ data. You can create an instance of the PlotManager by running the following command:

plot = PlotManager()

After that you will be automatically guided through the data processing and preparation. As soon as this has been finished the pipeline will automatically create an analysis of your data and will store the plots in

📦my_experiments
 ┣ 📂YOUR_EXPERIMENT_NAME
 ┃ ┣ 📂plots
 ┃ ┃ ┗ 📂aaSeqCDR3

If you would like to know more about ExpoSeq's generated data structure, please have a look here.

Generate a dashboard (optional)

The pipeline tries to generate a dashboard of your report in the initial launch. You can also generate it manually by typing the following command

plot.dashboard()

The pipeline will take your data and it will create a .html file in your default directory. You can open it with any webbrowser and receive a summary of your data and have new visualizations in form of wordclouds and much more!

Generate automatically a report of your whole analysis (optional)

The pipeline is able to utilize to generate a whole report automatically based on the the results generated by the pipeline. You can have a look at a generated example here by downloading and opening the html file.. Before you can use this funcitonality you need to install Quarto on your system. I used it in VS Code. After you have installed quarto you can run:

plot.create_report()

The pipeline will automatically check if Quarto exists on your device. You can use the quarto file generated by the pipeline to add your findings and individualize the report by utilizing the strength of markdown. If you have made changes to the file you can render it with quarto by running:

quarto render YOUR/PATH/TO/FILE.qmd

NOTE: Do not change the default structure of the files in the folder with the plots which is generated by the pipeline. Otherwise, this functionality might not work. Further the pipeline should contain the binding data if you have uploaded it and created the default plots with it.

Individualize Plots by using the PlotManager (optional)


If you want to create some plots by yourself, please take a look at the [Jupyter script](ExpoSeq_handsOn.ipynb).

In the following you can obtain an insight in the worklow of the pipeline after the initial call. There, the blue boxes indicate your input, gray are optional inputs while black and red are processing steps and output, respectively.
If you just want to test the pipeline and see its functions you can call:

plot = PlotManager(test_version = True)

If you would like to have details about the inputs and functions of the PlotManager call:
help(plot)

You can also call for specific plots, for instance:

help(plot.jaccard)

Upload binding data (optional)

If you have conducted DELFIA or other techniques to receive binding data for certain sequences (usually sanger sequenced), you can upload these in a certain format and use these for clustering to potentially find other suitable sequences with high binding. The table has to have the following format and can be created in excel.

aaSeqCDR3 Antigen 1 Antigen 2 Antigen 3
AIEAAAC 10000 30294 0
AEMNW 1000 0 0
PEICEES 0 1929 100000

You can upload the table as csv or xlsx file but make sure that the first column's name is aaSeqCDR3 and is in row 1. Besides, the given example you can have a look at an example file. If you decide to work with that file make sure to delete the first column which contains the row number.

If you have prepared your data you can upload it with:

plot.add_binding_data()

Note: If you decide to add more binding data to your analysis you can just use the same command and choose the new file with the filechooser and it will be added to the existing data. This can be also useful if you cannot manage to merge multiple antigens on the sequences in excel. Then you can just upload for each antigen separately the binding data.

Data processing on a Cluster (optional)

First pull the folder with the scripts for the processing to your working directory

git clone https://github.com/nilshof01/ExpoSeq

I have prepared an example jobscript for working on a LSF cluster. You can have a look under

cd ExpoSeq/bash_processing
nano example_LSF_cluster.sh

To run your script interactively you can call:

python ~/ExpoSeq/bash_processing/mixcr_cl.py $PATH_TO_MIXCR $YOUR_EXPERIMENT_NAME $PATH_TO_FORWARD_FILES 

NOTE: You need to have installed mixcr in your working directory to be able to start the processing. To use multithreading and increase the RAM allocation have a look at the following parameter you can define:

  • --path_to_mixcr: Is the filepath to the mixcr.jar file.
  • --experiment_name: A string which is the name of your experiment.
  • --path_to_forward: The directory of the fastq files with the forward reads.
  • --path_to_backward: The directory of the fastq files with the backward reads.
  • --threads: The number of threads you would like to use for the processing.
  • --method: The mixcr method to align and assemble the reads you would like to use. Default is milab-human-tcr-dna-multiplex-cdr3
  • --java_heap_size: Memory for proecssing in MB.

NOTE: If you only want to process forward reads, then you do not need to add the path to the directory with the backward reads. Further, if you would like to analyze paired end sequencing data, please make sure that forward and backward fastq files are in separate folders.

After the processing has been finished you can import the folder with the processed files for the plotmanager. You can find the corresponding folder under

📦my_experiments
 ┣ 📂YOUR_EXPERIMENT_NAME

As soon as you call it you need to press 2 to upload the directory with the files.

Talk with your data (optional)

I implemented pandasai in the pipeline to give you the option to investigate your data and even create plots without any further knowledge in programming with python. You can call the following command to start the chat with pandasai

plot.chat()

The pipeline will prompt you to enter an API. This is necessary, since you need to connect the engine to a large language model. You can obtain the API key from OpenAI.

NOTE: All conversations with large language models, such as GPT-3, are ont only highly energy demanding and thus costly but do also have a high environmental footprint. So, please do your research in advance and use this option under consideration of the environment.

References

[1] Dmitriy A. Bolotin, Stanislav Poslavsky, Igor Mitrophanov, Mikhail Shugay, Ilgar Z. Mamedov, Ekaterina V. Putintseva, and Dmitriy M. Chudakov. "MiXCR: software for comprehensive adaptive immunity profiling." Nature methods 12, no. 5 (2015): 380-381.

[2] Dmitriy A. Bolotin, Stanislav Poslavsky, Alexey N. Davydov, Felix E. Frenkel, Lorenzo Fanchi, Olga I. Zolotareva, Saskia Hemmers, Ekaterina V. Putintseva, Anna S. Obraztsova, Mikhail Shugay, Ravshan I. Ataullakhanov, Alexander Y. Rudensky, Ton N. Schumacher & Dmitriy M. Chudakov. "Antigen receptor repertoire profiling from RNA-seq data." Nature Biotechnology 35, 908–911 (2017)

[3] (1, 2) Tareen A, Kinney JB (2019) Logomaker: beautiful sequence logos in Python. Bioinformatics btz921. bioRxiv doi:10.1101/635029.

[4] M.A. Larkin and others, Clustal W and Clustal X version 2.0, Bioinformatics, Volume 23, Issue 21, November 2007, Pages 2947–2948, https://doi.org/10.1093/bioinformatics/btm404

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ExpoSeq-4.2.1.tar.gz (90.9 kB view hashes)

Uploaded Source

Built Distribution

ExpoSeq-4.2.1-py3-none-any.whl (116.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page