A pacakge which provides various ways to analyze NGS data from phage display campaigns
Project description
Welcome to ExpoSeq
ExpoSeq is a powerful pipeline for processing and analyzing FASTQ files from highthroughput sequencing of samples from display techniques such as phage display. It utilizes MiXCR to align and assemble the data which you can subsequently analyze in multiple plots. The pipeline focuses on analysing the identity between samples but also applies various clustering techniques to analyse the relation between the sequences. Besides, you can add binding data to relate the clusters to affinity.
Installation
Make sure you have installed Python on your system. After that you can install ExpoSeq in the terminal with
pip install ExpoSeq
Ensure that you have python > 3.9 installed.
To get started, please download and follow the instructions for MiXCR at their official documentation You can also only use the test version of ExpoSeq without installing it.
Importing the Plotting Tool
To access the plotting tool, you will need to import it into your console by running the following command:
from ExpoSeq.pipeline import PlotManager
The PlotManager is the main interface for creating various plots using your FASTQ data. You can create an instance of the PlotManager by running the following command:
plot = PlotManager()
After that you will be automatically guided through the data processing and preparation. As soon as this has been finished the pipeline will automatically create an analysis of your data and will store the plots in
📦my_experiments
┣ 📂YOUR_EXPERIMENT_NAME
┃ ┣ 📂plots
┃ ┃ ┗ 📂aaSeqCDR3
If you would like to know more about ExpoSeq's generated data structure, please have a look here.
Generate a dashboard (optional)
The pipeline tries to generate a dashboard of your report in the initial launch. You can also generate it manually by typing the following command
plot.dashboard()
The pipeline will take your data and it will create a .html file in your default directory. You can open it with any webbrowser and receive a summary of your data and have new visualizations in form of wordclouds and much more!
Generate automatically a report of your whole analysis (optional)
The pipeline is able to utilize to generate a whole report automatically based on the the results generated by the pipeline. You can have a look at a generated example here by downloading and opening the html file.. Before you can use this funcitonality you need to install Quarto on your system. I used it in VS Code. After you have installed quarto you can run:
plot.create_report()
The pipeline will automatically check if Quarto exists on your device. You can use the quarto file generated by the pipeline to add your findings and individualize the report by utilizing the strength of markdown. If you have made changes to the file you can render it with quarto by running:
quarto render YOUR/PATH/TO/FILE.qmd
NOTE: Do not change the default structure of the files in the folder with the plots which is generated by the pipeline. Otherwise, this functionality might not work. Further the pipeline should contain the binding data if you have uploaded it and created the default plots with it.
Individualize Plots by using the PlotManager (optional)
If you want to create some plots by yourself, please take a look at the [Jupyter script](ExpoSeq_handsOn.ipynb).
In the following you can obtain an insight in the worklow of the pipeline after the initial call. There, the blue boxes indicate your input, gray are optional inputs while black and red are processing steps and output, respectively.
If you just want to test the pipeline and see its functions you can call:
plot = PlotManager(test_version = True)
If you would like to have details about the inputs and functions of the PlotManager call:
help(plot)
You can also call for specific plots, for instance:
help(plot.jaccard)
Upload binding data (optional)
If you have conducted DELFIA or other techniques to receive binding data for certain sequences (usually sanger sequenced), you can upload these in a certain format and use these for clustering to potentially find other suitable sequences with high binding. The table has to have the following format and can be created in excel.
aaSeqCDR3 | Antigen 1 | Antigen 2 | Antigen 3 |
---|---|---|---|
AIEAAAC | 10000 | 30294 | 0 |
AEMNW | 1000 | 0 | 0 |
PEICEES | 0 | 1929 | 100000 |
You can upload the table as csv or xlsx file but make sure that the first column's name is aaSeqCDR3 and is in row 1. Besides, the given example you can have a look at an example file. If you decide to work with that file make sure to delete the first column which contains the row number.
If you have prepared your data you can upload it with:
plot.add_binding_data()
Note: If you decide to add more binding data to your analysis you can just use the same command and choose the new file with the filechooser and it will be added to the existing data. This can be also useful if you cannot manage to merge multiple antigens on the sequences in excel. Then you can just upload for each antigen separately the binding data.
Data processing on a Cluster (optional)
First pull the folder with the scripts for the processing to your working directory
git clone https://github.com/nilshof01/ExpoSeq
I have prepared an example jobscript for working on a LSF cluster. You can have a look under
cd ExpoSeq/bash_processing
nano example_LSF_cluster.sh
To run your script interactively you can call:
python ~/ExpoSeq/bash_processing/mixcr_cl.py $PATH_TO_MIXCR $YOUR_EXPERIMENT_NAME $PATH_TO_FORWARD_FILES
NOTE: You need to have installed mixcr in your working directory to be able to start the processing. To use multithreading and increase the RAM allocation have a look at the following parameter you can define:
--path_to_mixcr
: Is the filepath to the mixcr.jar file.--experiment_name
: A string which is the name of your experiment.--path_to_forward
: The directory of the fastq files with the forward reads.--path_to_backward
: The directory of the fastq files with the backward reads.--threads
: The number of threads you would like to use for the processing.--method
: The mixcr method to align and assemble the reads you would like to use. Default is milab-human-tcr-dna-multiplex-cdr3--java_heap_size
: Memory for proecssing in MB.
NOTE: If you only want to process forward reads, then you do not need to add the path to the directory with the backward reads. Further, if you would like to analyze paired end sequencing data, please make sure that forward and backward fastq files are in separate folders.
After the processing has been finished you can import the folder with the processed files for the plotmanager. You can find the corresponding folder under
📦my_experiments
┣ 📂YOUR_EXPERIMENT_NAME
As soon as you call it you need to press 2 to upload the directory with the files.
Talk with your data (optional)
I implemented pandasai in the pipeline to give you the option to investigate your data and even create plots without any further knowledge in programming with python. You can call the following command to start the chat with pandasai
plot.chat()
The pipeline will prompt you to enter an API. This is necessary, since you need to connect the engine to a large language model. You can obtain the API key from OpenAI.
NOTE: All conversations with large language models, such as GPT-3, are ont only highly energy demanding and thus costly but do also have a high environmental footprint. So, please do your research in advance and use this option under consideration of the environment.
References
[1] Dmitriy A. Bolotin, Stanislav Poslavsky, Igor Mitrophanov, Mikhail Shugay, Ilgar Z. Mamedov, Ekaterina V. Putintseva, and Dmitriy M. Chudakov. "MiXCR: software for comprehensive adaptive immunity profiling." Nature methods 12, no. 5 (2015): 380-381.
[2] Dmitriy A. Bolotin, Stanislav Poslavsky, Alexey N. Davydov, Felix E. Frenkel, Lorenzo Fanchi, Olga I. Zolotareva, Saskia Hemmers, Ekaterina V. Putintseva, Anna S. Obraztsova, Mikhail Shugay, Ravshan I. Ataullakhanov, Alexander Y. Rudensky, Ton N. Schumacher & Dmitriy M. Chudakov. "Antigen receptor repertoire profiling from RNA-seq data." Nature Biotechnology 35, 908–911 (2017)
[3] (1, 2) Tareen A, Kinney JB (2019) Logomaker: beautiful sequence logos in Python. Bioinformatics btz921. bioRxiv doi:10.1101/635029.
[4] M.A. Larkin and others, Clustal W and Clustal X version 2.0, Bioinformatics, Volume 23, Issue 21, November 2007, Pages 2947–2948, https://doi.org/10.1093/bioinformatics/btm404
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.