a python package to split machine learning data sets using graph partitioning

These details have not been verified by PyPI

Project links

Project description

mlNODS

split machine learning data sets using graph partioning

Appropriate assessments require appropriate splits of training and evaluation data sets and this in turn requires clustering. For many problems single-linkage clustering suffices toward this end. Encountering a problem that could not be solved by such a standard procedure, we developed a simple graph-based tool for the creation of unique data sets.

mlNODS is a graph-based method which allows to split original data sets into non-overlapping sets that cannot be grouped without removing some of the data. mlNODS optimizes the following constraints: (1) retain as many data points as possible, and (2) remove any overlap between two split sets. The nodes of our graph are the original data points, and the connections are measures for the similarity between nodes (e.g. sequence similarity for protein sets). The method begins by building the full graph and proceeds by removing nodes in order to optimally fit the constraints to the similarity table. mlNODS is applicable to any prob- lem and has the additional benefit of allowing overlap within one set (i.e. training on homologues) while it is disal- lowed between two sets (i.e. training and testing do not overlap).

usage: mlnods [-h] -s SPLITS -c CUTOFF [-l LIMIT] -e EDGES_FILE
                [-f EDGES_FORMAT] -n NODES_FILE [-a] [-r RANDOM]
                [-o OUTFOLDER] [-v] [-q] [--version]

This is a script that will create independent sets of data

Version: 1.0 [03/14/20]

optional arguments:
  -h, --help            show this help message and exit
  -s SPLITS, --splits SPLITS
                        number of splits required
  -c CUTOFF, --cutoff CUTOFF
                        similarity cutoff in the units of link scores
  -l LIMIT, --limit LIMIT
                        limit on the number of links for each node (default=0, infinity)
  -e EDGES_FILE, --edges EDGES_FILE
                        file containing a table of instances with link scores for each pair
  -f EDGES_FORMAT, --format EDGES_FORMAT
                        format of the table file

                        blast     : takes a list of -m 9 formated blast files and builds a table based on seqID
                        hssp      : takes a list of -m 9 formated blast files, runs HSSP scoring script and builds an HSSP distance table
                        self<int> : space/tab separated table file, similarity score in column <int>
                                    eg "ID1 ID2 similarity_score" will be addressed as self3 (default=self5)
  -n NODES_FILE, --nodes NODES_FILE
                        instance file containing IDs of all instances being considered

                        IDs are case-independent (eg ABC = abc)
                        IDs are always preceeded by ">" and followed by a white space.
                        No white spaces are allowed in an ID.
                        If score is provided for an ID, it should be surrounded by spaces and directly follow the ID
                        (eg. >abl1_human 10 gene associated with ....)
                        Everything between two IDs is printed in the junction files, but not considered in evaluation
  -a, --abundance       the option to score

                        false : score retrieved from instance file, range [0-100], default=50 when missing
                        true  : score approximated by actual number of times an ID appears in the instance file
  -r RANDOM, --random RANDOM
                        set a fixed random seed to generate consistent partitions
  -o OUTFOLDER, --outfolder OUTFOLDER
                        path to output folder (default=<current directory>
  -v, --verbose         set verbosity level
  -q, --quiet           no logging to stdout
  --version             show program's version number and exit

If an ID is present in the instance file, but not in the table file the ID is considered to not be linked to anything else
If an ID is present in the table file but not in the instance file, it is ignored

mlnods was developed by Yana Bromberg and refactored by Maximilian Miller.

Feel free to contact us for support at services@bromberglab.org.
This software is licensed under [NPOSL-3.0](http://opensource.org/licenses/NPOSL-3.0)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.3

Mar 15, 2020

1.2

Mar 15, 2020

This version

1.1

Mar 14, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlnods-1.1.tar.gz (25.6 kB view details)

Uploaded Mar 14, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlnods-1.1-py3-none-any.whl (26.5 kB view details)

Uploaded Mar 14, 2020 Python 3

File details

Details for the file mlnods-1.1.tar.gz.

File metadata

Download URL: mlnods-1.1.tar.gz
Upload date: Mar 14, 2020
Size: 25.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for mlnods-1.1.tar.gz
Algorithm	Hash digest
SHA256	`880dbaa15b3ef11268d2cbf18a925022504f43dff06841a9624d08dcc5973364`
MD5	`8d3ed9befbc4e73838cffe3924760f54`
BLAKE2b-256	`3d7bc214a06fcde5369ba16cca408def377a6355d1c3def1b5716dca22629732`

See more details on using hashes here.

File details

Details for the file mlnods-1.1-py3-none-any.whl.

File metadata

Download URL: mlnods-1.1-py3-none-any.whl
Upload date: Mar 14, 2020
Size: 26.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for mlnods-1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`731bae895d8e91644033ed8d2861167d23a934146f10241b385f15d78c354b5d`
MD5	`ac087a4cbebeda077483648dfb14f1de`
BLAKE2b-256	`0c9478ed5eef2e9a99fcaa073d432cdbb47f40e48f0e347dd8214c3a86eed35c`

See more details on using hashes here.

mlnods 1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mlNODS

split machine learning data sets using graph partioning

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes