Skip to main content

Precision filtering of RNA databases to curate high-quality datasets

Project description

NucleoSeeker - Precision filtering of RNA databases to curate high-quality datasets

Note: NucleoSeeker currently supports Unix-based systems, including macOS.

Dependencies

NucleoSeeker relies on a few external command-line tools. Before running the software, ensure these tools are properly installed on your system.

Dependency Minimum Version Installation Guide
Clustal Omega 1.2.4 Clustal Omega Setup Instructions
Infernal 1.1.5 Infernal Setup Instructions
Emboss 6.6.0 Emboss Setup Instructions (Optional)

Quick Installation Instructions

Install Clustal Omega

  • Clustal omega version supported 1.2.4
  wget http://www.clustal.org/omega/clustal-omega-1.2.4.tar.gz
  tar zxf clustal-omega-1.2.4.tar.gz
  cd clustal-omega-1.2.4
  ./configure --prefix /your/install/path
  make
  make check                 # optional: run automated tests
  make install               # optional: install Infernal programs, man pages

  # or use this

  sudo apt-get install clustalo

Install Emboss (Optional)

  • NOTE - Emboss is very slow, unless you are experimenting we don't recommend using it. Clustal Omega should be sufficient for most use cases.
  • Emboss version supported 6.6.0

Install Infernal

  • Infernal version supported 1.1.5
  wget http://eddylab.org/software/infernal/infernal.tar.gz
  tar zxf infernal.tar.gz
  cd infernal-1.1.5
  ./configure --prefix /your/install/path
  make
  make check                 # optional: run automated tests
  make install               # optional: install Infernal programs, man pages

  # or use this

  sudo apt-get install infernal infernal-doc

Get Rfam.cm file ready

  • To use this tool, you need to provide the Rfam covariance model which is available for download at https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz. It also needs to be modified using cmpress command from Infernal tool (mentioned above). If you don't have it then use the code below -
  mkdir -p rfam
  cd rfam
  wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz -O Rfam.cm.gz
  gunzip Rfam.cm.gz
  cmpress Rfam.cm

Generating new dataset

After the preliminary steps, a new dataset can be generated by installing NucleoSeeker as follows:

# We recommend setting up a virtual env when using this tool

python3 -m venv nucleoseeker_env
source nucleoseeker_env/bin/activate
pip install nucleoseeker

After you have prepared the environment you can generate datasets using the following code

export DATA_PATH=/your/desired/path/to/save/the/dataset
nucleoseeker \
        --dataset_name test_dataset \
        --rfam_cm_path your/rfam/path \
        --exptl_method "X-RAY DIFFRACTION" \
        --resolution 3.6 \
        --year_range 2019 \
        --dend 500
        --save 1 \

After using this command, a directory with the name test will be created in the DATA_PATH directory with the following structure:

      ├── DATA_PATH
      ├── pdb_files
      ├── test_dataset
          ├── files
          ├── sequences
          ├──clean_tblout.tblout
          ├──cmscan.out
          ├──combined.fasta
          ├──fam_pdb_chain.csv
          ├──final.fasta
          ├──raw_experimental_RNA_0_500.csv
          ├──sequence_identity_mat_clustal.csv
          ├──tblout.tblout

  • raw_experimental_RNA_0_500.csv: Data for first 500 results from the PDB database.

  • combined.fasta: Sequences used in sequence identity calculation by Clustal Omega and Emboss, obtained after applying StructureLevelFilter and PDBFilter on the raw data.

  • sequence_identity_mat_clustal.csv: Sequence identity matrix obtained from Clustal Omega and Emboss tools.

  • final.fasta: Final sequences in fasta format; the output if family analysis is not required.

  • cmscan.out, tblout.tblout, clean_tblout.tblout: Output files of the Infernal tool.

  • fam_pdb_chain.csv: Mapping of family and PDB chain, obtained after family search by Infernal; the final output for family analysis.

  • test_dataset/files: Directory containing dataframes and lists for structures at each filter level.

  • test_dataset/sequences: Directory containing sequences for all final structures in individual fasta files.

For some simple examples, please take a look at the Jupyter-Notebook in the examples directory of the GitHub repository here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nucleoseeker-0.1.2.tar.gz (3.0 MB view details)

Uploaded Source

Built Distribution

nucleoseeker-0.1.2-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file nucleoseeker-0.1.2.tar.gz.

File metadata

  • Download URL: nucleoseeker-0.1.2.tar.gz
  • Upload date:
  • Size: 3.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for nucleoseeker-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d554b6ba0d4b9e4352f75e086a016a147b36bfae50c233c8c006127acc58b151
MD5 ccec6601db613d44515f1fdc6ea67b92
BLAKE2b-256 22267265194d2e7b0f50424e8028892dd1c31b3a27b3d6132166defd7a610f83

See more details on using hashes here.

File details

Details for the file nucleoseeker-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for nucleoseeker-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c7ca0867c724a289b2393e7bf86bbef522e987275a967eef2088c3733e043cfb
MD5 0fa97baf24fff57594e051b382e99517
BLAKE2b-256 389616482749fe47daf14bbf8f192963de8dc30134a7c5e216a9c4025b768514

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page