Skip to main content

Precision filtering of RNA databases to curate high-quality datasets

Project description

NucleoSeeker - Precision filtering of RNA databases to curate high-quality datasets

Note: NucleoSeeker currently supports Unix-based systems, including macOS.

Dependencies

NucleoSeeker relies on a few external command-line tools. Before running the software, ensure these tools are properly installed on your system.

Dependency Minimum Version Installation Guide
Clustal Omega 1.2.4 Clustal Omega Setup Instructions
Infernal 1.1.5 Infernal Setup Instructions
Emboss 6.6.0 Emboss Setup Instructions (Optional)

Quick Installation Instructions

Install Clustal Omega

  • Instructions to setup clustal-omega can be found here.
  • Clustal omega version supported 1.2.4
  wget http://www.clustal.org/omega/clustal-omega-1.2.4.tar.gz
  tar zxf clustal-omega-1.2.4.tar.gz
  cd clustal-omega-1.2.4
  ./configure --prefix /your/install/path
  make
  make check                 # optional: run automated tests
  make install               # optional: install Infernal programs, man pages

  # or use this

  sudo apt-get install clustalo

Install Emboss (Optional)

  • NOTE - Emboss is very slow, unless you are experimenting we don't recommend using it. Clustal Omega should be sufficient for most use cases.
  • For setting up Emboss, please read here.
  • Emboss version supported 6.6.0

Install Infernal

  • For infernal follow instructions here.
  • Infernal version supported 1.1.5
  wget http://eddylab.org/software/infernal/infernal.tar.gz
  tar zxf infernal.tar.gz
  cd infernal-1.1.5
  ./configure --prefix /your/install/path
  make
  make check                 # optional: run automated tests
  make install               # optional: install Infernal programs, man pages

  # or use this

  sudo apt-get install infernal infernal-doc

Get Rfam.cm file ready

  • To use this tool, you need to provide the Rfam covariance model which is available for download at https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz. It also needs to be modified using cmpress command from Infernal tool (mentioned above). If you don't have it then use the code below -
  cd nucleoseeker
  mkdir -p rfam
  wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz -O Rfam.cm.gz
  gunzip Rfam.cm.gz
  cmpress Rfam.cm
  

Generating new dataset

After the preliminary steps, a new dataset can be generated by installing NucleoSeeker as follows:

# We recommend setting up a virtual env when using the this tool

python3 -m venv nucleoseeker_env
source nucleoseeker_env/bin/activate
pip install nucleoseeker

After you have prepared the environment you can generate datasets using the following code

export DATA_PATH=/your/desired/path/to/save/the/dataset
nucleoseeker \
        --dataset_name test_dataset \
        --rfam_cm_path your/rfam/path \
        --exptl_method "X-RAY DIFFRACTION" \
        --resolution 3.6 \
        --year_range 2019 \
        --dend 500
        --save 1 \

After using this command, a directory with the name test will be created in the DATA_PATH directory with the following structure:

      ├── DATA_PATH
      ├── pdb_files
      ├── test_dataset
          ├── files
          ├── sequences
          ├──clean_tblout.tblout
          ├──cmscan.out
          ├──combined.fasta
          ├──fam_pdb_chain.csv
          ├──final.fasta
          ├──raw_experimental_RNA_0_500.csv
          ├──sequence_identity_mat_clustal.csv
          ├──tblout.tblout

  • raw_experimental_RNA_0_500.csv: Data for first 500 results from the PDB database.

  • combined.fasta: Sequences used in sequence identity calculation by Clustal Omega and Emboss, obtained after applying StructureLevelFilter and PDBFilter on the raw data.

  • sequence_identity_mat_clustal.csv: Sequence identity matrix obtained from Clustal Omega and Emboss tools.

  • final.fasta: Final sequences in fasta format; the output if family analysis is not required.

  • cmscan.out, tblout.tblout, clean_tblout.tblout: Output files of the Infernal tool.

  • fam_pdb_chain.csv: Mapping of family and PDB chain, obtained after family search by Infernal; the final output for family analysis.

  • test_dataset/files: Directory containing dataframes and lists for structures at each filter level.

  • test_dataset/sequences: Directory containing sequences for all final structures in individual fasta files.

For some simple examples, please take a look at the this notebook.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nucleoseeker-0.1.1.tar.gz (3.0 MB view details)

Uploaded Source

Built Distribution

nucleoseeker-0.1.1-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file nucleoseeker-0.1.1.tar.gz.

File metadata

  • Download URL: nucleoseeker-0.1.1.tar.gz
  • Upload date:
  • Size: 3.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for nucleoseeker-0.1.1.tar.gz
Algorithm Hash digest
SHA256 eb5620408002551804d5916ccb47e73c81739daf9c3fde5312d225db3aa78443
MD5 a1e0b8aba96896205875c5d1540be2c7
BLAKE2b-256 cb5b14e3c59d0c2fc9d04bff9b120bf127b39e20c309ca5fc51bbcbacb546459

See more details on using hashes here.

File details

Details for the file nucleoseeker-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: nucleoseeker-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 21.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for nucleoseeker-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4a42cafdeedf3c8ef7b6308a4736a1e77ab7837057b8d905ec5bae5935f1fb3a
MD5 81b7447dc835cd38ebf48981c9e32608
BLAKE2b-256 60c76e394e7440a673c4c9e9299d9150384a3fc4cf7e6bdf0861682b650cdee2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page