Skip to main content

A package to generate prior gene regulatory networks.

Project description

master devel License: GPL v3

SPONGE - Simple Prior Omics Network GEnerator

The SPONGE package generates human prior gene regulatory networks and protein-protein interaction networks for the involved transcription factors.

Table of Contents

General Information

This repository contains the SPONGE package, which allows the generation of human prior gene regulatory networks based mainly on the data from the JASPAR database. It also uses NCBI to find the human analogs of vertebrate transcription factors, Ensembl to collect all the promoter regions in the human genome, UniProt for symbol matching, and STRING to retrieve protein-protein interactions between transcription factors. Because it accesses these databases on the fly, it requires internet access.

Prior gene regulatory networks are useful mainly as an input for tools that incorporate additional sources of information to refine them. The prior networks generated by SPONGE are designed to be compatible with PANDA and related NetZoo tools.

The purpose of this project is to give the ability to generate prior gene regulatory networks to people who do not have the knowledge or inclination to do the genome-wide motif search, but would still like to change some parameters that were used to generate publicly available prior gene regulatory networks. It is also designed to facilitate the inclusion of new information from database updates into the prior networks.

If you just want to use the prior networks generated by the stable version of SPONGE with the default settings, they are available on Zenodo.

Features

The features already available are:

  • Generation of prior gene regulatory network
  • Generation of prior protein-protein interaction network for transcription factors
  • Automatic download of required files during setup
  • Parallelised motif filtering
  • Command line interface

Setup

The requirements are provided in a requirements.txt file.

SPONGE can be installed via pip:

pip install netzoopy-sponge

Alternatively, it can be installed by downloading this repository and then installing with pip (possibly in interactive mode):

git clone https://github.com/ladislav-hovan/sponge.git
cd sponge
pip install -e .

Usage

SPONGE comes with a netzoopy-sponge command line script:

# Get information about the available options
netzoopy-sponge --help
# Run the pipeline
netzoopy-sponge

The script comes with a lot of options, but the defaults are designed to be sensible and the users do not have to change any of them unless desired.

Within Python, the default workflow can be invoked as follows:

# Import the class definition
from sponge.sponge import Sponge
# Run the default workflow
sponge_obj = Sponge(run_default=True)

Much like the command line script, the Sponge class implements many variables that give control over the process, and they can be changed from their defaults. For more information, you can run help(Sponge) after the import.

In case one needs more control over the individual steps, the workflow in Python would be as follows:

# Import the class definition
from sponge.sponge import Sponge
# Create the SPONGE object
sponge_obj = Sponge()
# Select the vertebrate transcription factors from JASPAR
sponge_obj.select_tfs()
# Find human homologs for the TFs if possible
sponge_obj.find_human_homologs()
# Filter the matches of the JASPAR bigbed file to the ones in the
# promoters of human transcripts
sponge_obj.filter_matches()
# Aggregate the filtered matches on promoters to genes
sponge_obj.aggregate_matches()
# Write the final motif prior to a file
sponge_obj.write_motif_prior()
# Retrieve the protein-protein interactions between the transcription
# factors from the STRING database
sponge_obj.retrieve_ppi()
# Write the PPI prior to a file
sponge_obj.write_ppi_prior()

SPONGE will attempt to download the files it needs into a temporary directory (.sponge_temp by default). Paths can be provided if these files were downloaded in advance. The JASPAR bigbed file required for filtering is huge (> 100 GB), so the download might take some time. Make sure you're running SPONGE somewhere that has enough space!

As an alternative to the bigbed file download, SPONGE can download tracks for individual TFs on the fly and filter them individually. This way of processing is slower than the bigbed file when all TFs in the database are considered, but it becomes competitive when only a subset is used. The physical storage footprint is much reduced. The option is enabled with on_the_fly_processing=True.

File formats

Users are free to provide their own files for the list of regions of interest (key name promoter, default name promoters.bed), mapping of transcripts to genes (ensembl: ensembl.tsv) and the list of predicted TF binding sites (jaspar_bigbed: JASPAR.bb). By default, if the paths are not provided through the keyword paths_to_files, SPONGE attempts to locate these files in the temporary folder under the default names. If it fails to do so, it will proceed to download them.

List of regions of interest expects a bed file in the 6 column format without a header, for example:

chr1    11119   12119   ENST00000456328 0       +
chr1    11260   12260   ENST00000450305 0       +
chr1    17186   18186   ENST00000619216 0       -
chr1    24636   25636   ENST00000488147 0       -

Mapping of transcripts to genes expects a four column tsv file with a defined header, as an example:

Transcript stable ID    Gene stable ID  Gene name       Gene type
ENST00000387314 ENSG00000210049 MT-TF   Mt_tRNA
ENST00000389680 ENSG00000211459 MT-RNR1 Mt_rRNA
ENST00000387342 ENSG00000210077 MT-TV   Mt_tRNA
ENST00000387347 ENSG00000210082 MT-RNR2 Mt_rRNA
ENST00000386347 ENSG00000209082 MT-TL1  Mt_tRNA
ENST00000361390 ENSG00000198888 MT-ND1  protein_coding

The Transcript stable ID field will be used to match regions of interest. Finally, the predicted TF binding sites are expected in a binary bigbed file, with the following format when decoded:

chrom   start      end      name  score strand TFName
chr1   10000    10006  MA0467.3    276      -    Crx
chr1   10000    10006  MA0648.2    233      +    GSC
chr1   10000    10006  MA0682.3    231      +  PITX1
chr1   10000    10006  MA0711.2    198      +   OTX1
chr1   10000    10006  MA0714.2    246      +  PITX3

Effectively, it is an extended bed format with a header, which uses the name column to provide JASPAR matrix ID and the TFName column to provide the actual name of the transcription factor. However, currently SPONGE expects a bigbed file and will not work with a bed file.

Container

SPONGE releases are also provided as Docker containers. The most basic way of running would involve mounting a directory to the /data directory on the container, where networks will be written by default:

docker run --mount type=bind,source="$(pwd)"/output,target=/data ghcr.io/ladislav-hovan/netzoopy_sponge:latest

Help can be requested with the --help argument. For the most part the arguments match those of the netzoopy-sponge command line script, but interactive prompts are disabled.

Project Status

The project is: in progress.

Room for Improvement

Room for improvement:

  • Try incorporating unipressed
  • Improve overlap computations

To do:

  • Support for more species

Acknowledgements

Many thanks to the members of the Kuijjer group at NCMM for their feedback and support.

This README is based on a template made by @flynerdpl.

Contact

Created by Ladislav Hovan (ladislav.hovan@ncmm.uio.no). Feel free to contact me!

License

This project is open source and available under the GNU General Public License v3.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

netzoopy_sponge-1.1.0.tar.gz (42.3 kB view details)

Uploaded Source

Built Distribution

netzoopy_sponge-1.1.0-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file netzoopy_sponge-1.1.0.tar.gz.

File metadata

  • Download URL: netzoopy_sponge-1.1.0.tar.gz
  • Upload date:
  • Size: 42.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for netzoopy_sponge-1.1.0.tar.gz
Algorithm Hash digest
SHA256 faeac2abe5bc0b7eabee11e93e1b9ca45f946e2d977e57e4c6888def826ff1bc
MD5 748b6f0533672552572fffed849f44ec
BLAKE2b-256 ee6ce8b57cb9b80270d0dced704a6e8ffe91f46360ad7261422cc9328cf700ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for netzoopy_sponge-1.1.0.tar.gz:

Publisher: publish_pypi.yaml on ladislav-hovan/sponge

Attestations:

File details

Details for the file netzoopy_sponge-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for netzoopy_sponge-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 04fd7b94c010becaf26af9ff812479f4ddad332b0434a8debf58b4fc13a967e1
MD5 338a96e6a6d1342f7c7cfdc38773219a
BLAKE2b-256 76a964a7b525ec92ee12ba6c2d713c4060b9028ce78c9d123d2a21195b238b19

See more details on using hashes here.

Provenance

The following attestation bundles were made for netzoopy_sponge-1.1.0-py3-none-any.whl:

Publisher: publish_pypi.yaml on ladislav-hovan/sponge

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page