Precision filtering of RNA databases to curate high-quality datasets
Project description
NucleoSeeker - Precision filtering of RNA databases to curate high-quality datasets
Note: NucleoSeeker currently supports Unix-based systems, including macOS.
Dependencies
NucleoSeeker relies on a few external command-line tools. Before running the software, ensure these tools are properly installed on your system.
Dependency | Minimum Version | Installation Guide |
---|---|---|
Clustal Omega | 1.2.4 |
Clustal Omega Setup Instructions |
Infernal | 1.1.5 |
Infernal Setup Instructions |
Emboss | 6.6.0 |
Emboss Setup Instructions (Optional) |
Quick Installation Instructions
Install Clustal Omega
- Clustal omega version supported
1.2.4
wget http://www.clustal.org/omega/clustal-omega-1.2.4.tar.gz
tar zxf clustal-omega-1.2.4.tar.gz
cd clustal-omega-1.2.4
./configure --prefix /your/install/path
make
make check # optional: run automated tests
make install # optional: install Infernal programs, man pages
# or use this
sudo apt-get install clustalo
Install Emboss (Optional)
- NOTE - Emboss is very slow, unless you are experimenting we don't recommend using it. Clustal Omega should be sufficient for most use cases.
- Emboss version supported
6.6.0
Install Infernal
- Infernal version supported
1.1.5
wget http://eddylab.org/software/infernal/infernal.tar.gz
tar zxf infernal.tar.gz
cd infernal-1.1.5
./configure --prefix /your/install/path
make
make check # optional: run automated tests
make install # optional: install Infernal programs, man pages
# or use this
sudo apt-get install infernal infernal-doc
Get Rfam.cm file ready
- To use this tool, you need to provide the Rfam covariance model which is available for download at https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz. It also needs to be modified using
cmpress
command fromInfernal
tool (mentioned above). If you don't have it then use the code below -
mkdir -p rfam
cd rfam
wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz -O Rfam.cm.gz
gunzip Rfam.cm.gz
cmpress Rfam.cm
Generating new dataset
After the preliminary steps, a new dataset can be generated by installing NucleoSeeker
as follows:
# We recommend setting up a virtual env when using this tool
python3 -m venv nucleoseeker_env
source nucleoseeker_env/bin/activate
pip install nucleoseeker
After you have prepared the environment you can generate datasets using the following code
export DATA_PATH=/your/desired/path/to/save/the/dataset
nucleoseeker \
--dataset_name test_dataset \
--rfam_cm_path your/rfam/path \
--exptl_method "X-RAY DIFFRACTION" \
--resolution 3.6 \
--year_range 2019 \
--dend 500
--save 1 \
After using this command, a directory with the name test
will be created in the DATA_PATH
directory with the following structure:
├── DATA_PATH
├── pdb_files
├── test_dataset
├── files
├── sequences
├──clean_tblout.tblout
├──cmscan.out
├──combined.fasta
├──fam_pdb_chain.csv
├──final.fasta
├──raw_experimental_RNA_0_500.csv
├──sequence_identity_mat_clustal.csv
├──tblout.tblout
-
raw_experimental_RNA_0_500.csv: Data for first 500 results from the PDB database.
-
combined.fasta: Sequences used in sequence identity calculation by Clustal Omega and Emboss, obtained after applying StructureLevelFilter and PDBFilter on the raw data.
-
sequence_identity_mat_clustal.csv: Sequence identity matrix obtained from Clustal Omega and Emboss tools.
-
final.fasta: Final sequences in fasta format; the output if family analysis is not required.
-
cmscan.out, tblout.tblout, clean_tblout.tblout: Output files of the Infernal tool.
-
fam_pdb_chain.csv: Mapping of family and PDB chain, obtained after family search by Infernal; the final output for family analysis.
-
test_dataset/files: Directory containing dataframes and lists for structures at each filter level.
-
test_dataset/sequences: Directory containing sequences for all final structures in individual fasta files.
For some simple examples, please take a look at the Jupyter-Notebook in the examples
directory of the GitHub repository here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nucleoseeker-0.1.2.tar.gz
.
File metadata
- Download URL: nucleoseeker-0.1.2.tar.gz
- Upload date:
- Size: 3.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d554b6ba0d4b9e4352f75e086a016a147b36bfae50c233c8c006127acc58b151 |
|
MD5 | ccec6601db613d44515f1fdc6ea67b92 |
|
BLAKE2b-256 | 22267265194d2e7b0f50424e8028892dd1c31b3a27b3d6132166defd7a610f83 |
File details
Details for the file nucleoseeker-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: nucleoseeker-0.1.2-py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c7ca0867c724a289b2393e7bf86bbef522e987275a967eef2088c3733e043cfb |
|
MD5 | 0fa97baf24fff57594e051b382e99517 |
|
BLAKE2b-256 | 389616482749fe47daf14bbf8f192963de8dc30134a7c5e216a9c4025b768514 |