Skip to main content

A parser for ISCN data.

Project description

ISCNSNAKE : ISCN Structural and Numerical Analysis of Karyotype Entries

ISCNSNAKE is a tool to use with ISCN : the International System for Cytogenic Nomenclature written in python 3.7. The repository contains three python files, one is the core of the module (ISCNParser.py) and the second (run_SNAKE.py) is a quick script that executes the main function of ISCNParser. Most features are accessible through the console menu that opens when the script is ran but if you want to automate multiple files or do something more complicated it is suggested that you import the module into your own script. The third script (quick_run_SNAKE.py) runs ISCNParser at the default settings with users only having to specify the location of the Mitelman database on their computer, the desired output folder and filenames, and any criteria they wish to use as filters. For a description on how to use these filters consult readthedocs.org/ISCNSNAKE/

SYSTEM REQUIREMENTS

Python 3.6 or greater (anaconda strongly recommended)

Python modules

  • pandas
  • numpy
  • easy_install (for installation)

INSTALLATION

  1. Open either terminal or command line in the fold you place the download (this folder).
  2. Input the following:

easy_install --always-unzip .

  1. You should be able to run the program from anywhere with the upcoming instructions.

HOW TO USE

quick_SNAKE

runs ISCNSNAKE.parse_file at the default settings, so if want to change the handling of clones, filenames, or analyze aberration cooccurance you will have to interact with the menu provided by run_SNAKE.py. For the vast majority of use cases this version is more than adequate.

After installation, quick_SNAKE can be ran from any location using the following prompt entered in terminal or command prompt:

python -m ISCNSNAKE.quick_SNAKE [-options]

This can also be done by entering interactive mode (just type python) and then:

import ISCNSNAKE.quick_SNAKE

With no options specified, the program will ask for some user input. These can also be specified by command line options.

[-f] : Mitelman filters for selection rows and columns of the Mitelman database.

Filter syntax is “column:attribute”. For example, if you wanted to select only cases with topography indicated as breast, you would enter “Topography:Breast” as a filter. If you want to add additional filters, such as wanting to specify only adenocarcinomas with the same topography, create a second filter with the same syntax and separate it with two forward slashes.

Example: “Topography:Breast//Morphology:Adenocarcinoma”.

Double slashes are used as commas, spaces, and dashes are all within the attributes in the Mitelman Database. Filters place restrictions on which categories are acceptable so it is possible to analyze multiple types by adding more filters.

Example: “Topography:Breast//Topography:Lung//Morphology:Adenocarcinoma” This would analyze all Breast and Lung adenocarcinomas

[-p] : File prefix, determines names of output files. ex. -p Skeletal makes output files Skeletal_DeepDel.txt, Skeletal_DeepAmp.txt [-o] : Name of output folder, if it doesn’t exist it will make one. Defaults to current directory. [-v] : Verbosity. If specified, will output each ISCN it comes across. Useful if you’re unsure whether or not your filters were correct on filter usage but does slow the program down a bit. [-i] : Path to the plain text file of the Mitelman database

Step by Step Tutorial:

In this example I will be using the Mitelman.txt file included in the download, looking to analyze Retinoblastomas.

  1. After installation, navigate to the folder where your desired data file is.
  2. Enter the command: python -m ISCNSNAKE.quick_SNAKE -i -p RB -o retinoblastoma -v

You will then be prompted to add filters. Since we want to analyze retinoblastomas we should enter "Topography:Eye//Morphology:Retinoblastoma"

You will then see the output to the terminal (because you specified verbose) and the files created from the analysis in the folder "retinoblastoma"

The full version of the program with many experimental settings can be ran with the following prompt:

python -m ISCNSNAKE.run_SNAKE

This can also be done in interactive mode.

This will open a menu that will allow you to configure many options. Here is a brief description of what each of the of the more complex options do. Many of the options simply specify filenames and should be fairly self-explanitory.

1 -- input filename Required. Enter the path to the file containing the karyotype data. ex: ../../Mitelman.txt

2 -- input data type (ISCN/Mitelman) Can be set to either ISCN or Mitelman. Default ISCN. Mitelman is the format that the text version of the database is in, as provided in Mitelman.txt. ISCN is a format where each karyotype in ISCN format is on a seperate line by itself, as in ISCN_example.txt

3 -- clone handling

Set this to one the 3 following options: first_only: only analyzes the first clone in each ISCN, good for small data sets with few clonal karyotypes and one or two massive karyotypes with 10+ clones seperate: analyzes each clone as a seperate patient, your frequency percentages will be very low, I personally don't recommend this option. merge: creates a consensus karyotype out of all the clonal karyotypes

4 -- Output to file

If this is False, you will get no output. Can be good if you just want the raw data. To use this you will have to change how you call run_SNAKE or assign _ to a variable afterwards.

6 -- output mode relative if you want percentages, minmax if you want it to be a result between 0 and 1 (for machine learning and possibly other purposes)\

7 -- current filters

Enter filters as described for quick_SNAKE but instead of seperating with double slashes just enter each one seperately.

You may also incorporate elements of the ISCNParser algorithm into your script by importing ISCNSNAKE.ISCNParser and then calling the functions and classes for your needs.

15 -- analyze dependancies Will make a list of co occurances of various aberrations. Very very very very slow. Creates a list of links in circos format.

16 -- dependence quantile Trims links to the quantile selected. Useful as most dependancy matrixes end up with hundreds of thousands of co-occurances.

18 -- analyze errors Set to True if you want a seperate file containing every instance of an unparsable karyotype. Could be used pretty easily to check your own karyotypes to check for proper ISCN syntax. Will try to explain errors but isn't very descriptive.

20 -- verbose Set to True if you want all the karyotypes the program reads to be printed to console. (same as -v in quick_SNAKE)

Authors

  • Connor Denomy (connordenomy@gmail.com) - Lead author Feel free to email me any questions you have about using this program.

Acknowledgments

  • Samuel Germain for creating the prototype of this program in shell/perl and providing the basis for the core algorithm
  • Bjorn Haave for assisting
  • Vizeacoumar lab

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ISCNSNAKE-2.2.tar.gz (1.7 MB view hashes)

Uploaded Source

Built Distribution

ISCNSNAKE-2.2-py3-none-any.whl (48.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page