Loads STRING data into NDEx
Project description
NDEx STRING Content Loader
Python application for loading STRING data into NDEx.
This tool downloads and unpacks the STRING files below
9606.protein.links.full.v11.0.txt.gz
This loader generates one or more TSV files, converts them to CX, and uploads them to NDEx server.
The number of networks generated is dictated by the --cutoffscore
parameter which by
default generates two networks, one with all edges 0.0 cutoffscore and one with edges with score 0.7
and above
Duplicate edges (edges that have the same Source and Target nodes and the same value of combined_score
)
are included to the generated TSV and CX files only once.
Name of the newly generated network includes the value of cutoffscore
argument, for example,
STRING - Human Protein Links - High Confidence (Score >= 0.7)
.
In case user didn’t specify --update UUID
argument, then the network with this name gets over-written in case if already exists on NDEx server;
otherwise, a new network is created.
Specifying --update UUID
command line argument will over-write network with this UUID if it is found.
If not, then user is asked if (s)he wants to create a new network. When network is updated, only edges and nodes are
changed; network attributes other then version are not modified.
1) Below is an example of a record from 9606.protein.links.full.v11.0.txt.gz
9606.ENSP00000261819 9606.ENSP00000353549 0 0 0 0 0 102 90 987 260 900 0 754 622 999
To generate a STRING network, the loader reads rows from that file one by one and compares the value of the last
column combined_score
with the value cutoffscore
argument. The row is not added to the network generated in case
combined_score
is less than the commad-line argument cutoffscore
.
2) If combined_score
is no than less cutoffscore
, the loader process two first columns
column 1 - protein1 (9606.ENSP00000261819)
column 2 - protein2 (9606.ENSP00000353549)
When processing first column protein1
, the script
replaces Ensembl Id
with a display name
, for example 9606.ENSP00000261819
becomes ANAPC5
. Mapping of
display names to Enseml Ids
is found in
human.name_2_string.tsv.gz
uses human.uniprot_2_string.2018.tsv.gz
to create represents
value. For example, represents
for 9606.ENSP00000261819
is uniprot:Q9UJX4
uses human.entrez_2_string.2018.tsv.gz
to create list of aliases for the current protein. Thus, list of aliases for 9606.ENSP00000261819
is
ncbigene:51433|ensembl:ENSP00000261819
3) The second column protein2
is processed the same way as column 1
.
4) In the generated tsv file 9606.protein.links.tsv
, protein1
and protein2
values from the original file are replaced with
protein_display_name_1 represents_1 alias_1 protein_display_name_2 represents_2 alias_2
So, the original
9606.ENSP00000261819 9606.ENSP00000353549 0 0 0 0 0 102 90 987 260 900 0 754 622 999
becomes
ANAPC5 uniprot:Q9UJX4 ncbigene:51433|ensembl:ENSP00000261819 CDC16 uniprot:Q13042 ncbigene:8881|ensembl:ENSP00000353549 0 0 0 0 0 102 90 987 260 900 0 754 622 999
5) The generated tsv file 9606.protein.links.tsv
is then transformed to CX 9606.protein.links.cx
.
The default style defined in style.cx
distributed with this loader is applied to the
generated network in case neither --style
nor --template
is specified.
User can specify style template file with either --style
argument or
style template network UUID --template UUID_of_style_template_network
.
Specifying both --template
and --style
is not allowed.
6) 9606.protein.links.cx
is then uploaded to NDEx server either replacing
an existing network (in case --update UUID
is specified),
or creating a new network.
Dependencies
ndex2
ndexutil
networkx
scipy
requests
py4cytoscape
pandas
Compatibility
Python 3.6+
Installation
git clone https://github.com/ndexcontent/ndexstringloader
cd ndexstringloader
make dist
pip install dist/ndexloadstring*whl
Run make command with no arguments to see other build/deploy options including creation of Docker image
make
Output:
clean remove all build, test, coverage and Python artifacts
clean-build remove build artifacts
clean-pyc remove Python file artifacts
clean-test remove test and coverage artifacts
lint check style with flake8
test run tests quickly with the default Python
test-all run tests on every Python version with tox
coverage check code coverage quickly with the default Python
docs generate Sphinx HTML documentation, including API docs
servedocs compile the docs watching for changes
testrelease package and upload a TEST release
release package and upload a release
dist builds source and wheel package
install install the package to the active Python's site-packages
dockerbuild build docker image and store in local repository
dockerpush push image to dockerhub
Configuration
The ndexloadstring.py requires a configuration file to be created.
The default path for this configuration is ~/.ndexutils.conf
but can be overridden with
--conf
flag.
Configuration file
Networks listed in [network_ids] section need to be visible to the user
[ndexstringloader]
user = joe123
password = somepassword123
server = dev.ndexbio.org
Needed files
Load plan is required for running this script. string_plan.json found at ndexstringloader/ndexstringloader can be used for this purpose.
Usage
For information invoke ndexloadstring.py -h
Example usage
Here is how this command can be run for dev and prod targets:
ndexloadstring.py --profile dev tmpdir/
ndexloadstring.py --profile prod tmpdir/ --cutoffscore 0.99 0.95
Credits
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
History
1.0.3 (2023-09-20)
Updated URL paths to data files because they moved on STRING server
Set default STRING version to “12.0”
Version of STRING data used is now appended to the network name.
Updated default network description and default style template file
1.0.2 (2022-06-29)
Fixed bug where --stringversion was being ignored when downloading data files
Set default version to 11.5
Fixed bug where version network attribute was not being updated with value of --stringversion
Changed URL to human.entrez_2_string.2018.tsv.gz cause it moved on STRING server
--cutoffscore parameter can now take multiple values and a network for each value will be generated and uploaded to NDEx. The default is set to generate a network with all edges (0.0 –cutoffscore) and a network with edges 0.7 and above
1.0.0 (2020-11-11)
New default behavior: force-directed-cl layout is now applied on networks via py4cytoscape library and a running instance of Cytoscape. Alternate Cytoscape layouts and the networkx “spring” layout can be run by setting appropriate value via the new –layout flag
0.3.0 (2020-10-28)
Added --skipupload that lets caller skip upload of network to NDEx
Spring layout applied by default for all networks that have less then 2,000,000 edges. This can be overridden with new flag --layoutedgecutoff
0.2.4 (2019-12-01)
Fixed defect UD-462 Verify new network attributes are correctly set in ndexstringloader (https://ndexbio.atlassian.net/browse/UD-462).
0.2.3 (2019-09-13)
If user loads the entire STRING network (i.e., runs the script with –cutoffscore 0), the name of the resulting netwpork should be “STRING - Human Protein Links”, not “STRING - Human Protein Links - High Confidence”.
0.2.2 (2019-09-12)
Added new featured specified by UD-577 Quick improvement for new String loader (added optional –update argument that allows to specify the UUID of a target network to update; added optional –template argument that allows to specify the UUID of a target network to use as style template, the update operation now only changes nodes and edges, but leaves network properties untouched).
0.2.1 (2019-08-23)
Improved README file.
Added new JUnit tests (JUnit test coverage is 87%).
0.2.0 (2019-07-26)
Removed duplicate edges. Every pair of connected nodes in STRING networks had the same edge duplicated (one edge going from A to B, and another going from B to A). Since edges in STRING are not directed, we can safely remove half of them.
- Added new arguments to command line:
optional –cutoffscore (default is 0.7) - used to filter on combined_score column. To include edges with combined_score of 800 or higher, –cutoffscore 0.8 should be specified
required –datadir specifies a working directory where STRING files will be downloaded to and processed style.cx file that contains style is supplied with the STRING loader and used by default. It can be overwritten with –style argument.
0.1.0 (2019-03-13)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ndexstringloader-1.0.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1fe091323da239b981aa455a50ac75bb962091a5b29ef42fb239971476fb9165 |
|
MD5 | 9c5db274d0c185adf3c769ace12dbf98 |
|
BLAKE2b-256 | 48ac9a2d5a1b43a688fce85402608d9ddd828be2ae4ab97bcf3e05748abf5398 |