Skip to main content

Precision filtering of RNA databases to curate high-quality datasets

Project description

NucleoSeeker - Precision filtering of RNA databases to curate high-quality datasets

Note: NucleoSeeker currently supports Unix-based systems, including macOS.

Dependencies

NucleoSeeker relies on a few external command-line tools. Before running the software, ensure these tools are properly installed on your system.

Dependency Minimum Version Installation Guide
Clustal Omega 1.2.4 Clustal Omega Setup Instructions
Infernal 1.1.5 Infernal Setup Instructions
Emboss 6.6.0 Emboss Setup Instructions (Optional)

Quick Installation Instructions

Install Clustal Omega

  • Clustal omega version supported 1.2.4
  wget http://www.clustal.org/omega/clustal-omega-1.2.4.tar.gz
  tar zxf clustal-omega-1.2.4.tar.gz
  cd clustal-omega-1.2.4
  ./configure --prefix /your/install/path
  make
  make check                 # optional: run automated tests
  make install               # optional: install Infernal programs, man pages

  # or use this

  sudo apt-get install clustalo

Install Emboss (Optional)

  • NOTE - Emboss is very slow, unless you are experimenting we don't recommend using it. Clustal Omega should be sufficient for most use cases.
  • Emboss version supported 6.6.0

Install Infernal

  • Infernal version supported 1.1.5
  wget http://eddylab.org/software/infernal/infernal.tar.gz
  tar zxf infernal.tar.gz
  cd infernal-1.1.5
  ./configure --prefix /your/install/path
  make
  make check                 # optional: run automated tests
  make install               # optional: install Infernal programs, man pages

  # or use this

  sudo apt-get install infernal infernal-doc

Get Rfam.cm file ready

  • To use this tool, you need to provide the Rfam covariance model which is available for download at https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz. It also needs to be modified using cmpress command from Infernal tool (mentioned above). If you don't have it then use the code below -
  mkdir -p rfam
  cd rfam
  wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz -O Rfam.cm.gz
  gunzip Rfam.cm.gz
  cmpress Rfam.cm

Generating new dataset

After the preliminary steps, a new dataset can be generated by installing NucleoSeeker as follows:

# We recommend setting up a virtual env when using this tool

python3 -m venv nucleoseeker_env
source nucleoseeker_env/bin/activate
pip install nucleoseeker

After you have prepared the environment you can generate datasets using the following code

export DATA_PATH=/your/desired/path/to/save/the/dataset
nucleoseeker \
        --dataset_name test_dataset \
        --rfam_cm_path your/rfam/path \
        --exptl_method "X-RAY DIFFRACTION" \
        --resolution 3.6 \
        --year_range 2019 \
        --dend 500
        --save 1 \

After using this command, a directory with the name test will be created in the DATA_PATH directory with the following structure:

      ├── DATA_PATH
         ├── pdb_files
         ├── test_dataset
             ├── files
             ├── sequences
             ├──clean_tblout.tblout
             ├──cmscan.out
             ├──combined.fasta
             ├──fam_pdb_chain.csv
             ├──final.fasta
             ├──raw_experimental_RNA_0_500.csv
             ├──sequence_identity_mat_clustal.csv
             ├──tblout.tblout

  • raw_experimental_RNA_0_500.csv: Data for first 500 results from the PDB database.

  • combined.fasta: Sequences used in sequence identity calculation by Clustal Omega and Emboss, obtained after applying StructureLevelFilter and PDBFilter on the raw data.

  • sequence_identity_mat_clustal.csv: Sequence identity matrix obtained from Clustal Omega and Emboss tools.

  • final.fasta: Final sequences in fasta format; the output if family analysis is not required.

  • cmscan.out, tblout.tblout, clean_tblout.tblout: Output files of the Infernal tool.

  • fam_pdb_chain.csv: Mapping of family and PDB chain, obtained after family search by Infernal; the final output for family analysis.

  • test_dataset/files: Directory containing dataframes and lists for structures at each filter level.

  • test_dataset/sequences: Directory containing sequences for all final structures in individual fasta files.

For some simple examples, please take a look at the Jupyter-Notebook in the examples directory of the GitHub repository here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nucleoseeker-0.1.3.tar.gz (3.0 MB view details)

Uploaded Source

Built Distribution

nucleoseeker-0.1.3-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file nucleoseeker-0.1.3.tar.gz.

File metadata

  • Download URL: nucleoseeker-0.1.3.tar.gz
  • Upload date:
  • Size: 3.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for nucleoseeker-0.1.3.tar.gz
Algorithm Hash digest
SHA256 856a45701f3ef825a3eb6b2b5254e965b13433378697e83549ac92e20111f79a
MD5 d6987fd05631ab2ddceb1381cbd6f83b
BLAKE2b-256 6b917a39e011a7175e1a6047d36d659c3c076125e6c3930eb892ee2093789796

See more details on using hashes here.

File details

Details for the file nucleoseeker-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for nucleoseeker-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d2140cbabed201959db721046f6a251051ff092d41f4127faa1fdd2163f67f07
MD5 a63c1bb56e49c07d5f447a29dd23d338
BLAKE2b-256 2cf32878e973339a8b9423a2b1b15d22869336ae7658d5c601bf6ccd877a60aa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page