Skip to main content

smartest package related to data fixing

Project description

## datamast
A package which handles lot of things for you including detecting useful columns from a mass spectrometry output csv files, process it and converts to GCT.

---
## Motivation
Polly is collecting data at a fast pace. But even today, it is very difficult for the end user to run an analysis on Polly. The end user is often restricted to Polly’s hard-written input format rules and a lot of the times is unable to run anything due to inconsistencies or type/format errors in the dataset. The end user even has to input the information which is often redundant and could be easily automated.

---

## Tech/framework used

**Built with**
- [Numpy]()
- [Pandas]()
- [cmapPy]()
- [elasticsearch]()
---
## Features
* Detects some special kind of columns( **Intensity, Metabolites and Samples**) from mass spectrometry related CSV files
* Converts the input CSV/XLSX files into a standardised format, i.e., GCT
---
## Code Example
---

## Installation

####Prerequisite Installation :
First of all in order to use the above package you have to install elasticsearch on your system. After installation, you have to create a elastic cluster index and bulk index metabolites data onto it. For that simply perform the following commands in the order from the bash shell after installing the elastic search. For performing the third command download the file **csv_to_elastic.py** from the link : https://drive.google.com/open?id=1cYDX71rzgAV1M6dCZB2TRpNpv2zewjJX and put it in your current directory.

* **First Command**: ( Configuring the analyzer and tokenizer)

curl -X PUT "localhost:9200/compound?pretty" -H 'Content-Type: application/json' -d'
{
"settings": {
"number_of_shards": 1,
"analysis" : {
"filter" : {
"filter_for_search" : {
"type" : "length",
"max" : "20",
"min" : "5"
},
"autocomplete_filter" : {
"type" : "edge_ngram",
"min_gram" : "5",
"max_gram" : "20"
}
},
"analyzer" : {
"analyser_for_search" : {
"filter" : [
"lowercase",
"filter_for_search"
],
"type" : "custom",
"tokenizer" : "standard"
},
"autocomplete" : {
"filter" : [
"lowercase",
"autocomplete_filter"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'

* **Second Command** (Applying that settings to mapping "my_type")

curl -X PUT "localhost:9200/compound/_mapping/my_type?pretty" -H 'Content-Type: application/json' -d'
{
"my_type": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "analyser_for_search"
}
}
}
}
'

* **Third Command** (for indexing data)

$ python csv_to_elastic.py \
--elastic-address 'localhost:9200' \
--csv-file <**path of your metabolites csv file**> \
--elastic-index compound \
--datetime-field=dateField \
--elastic-type my_type \
--delimiter ','
--json-struct '{
"name" : "%<**name of the metabolite column**>%",
}'


---

#### Installing the package

First clone the repository and run the command **pip install .** from the terminal in the directory itself

---
## How to use?

First start the elastic search cluster on your system which you previously configured. Following is the code demonstration which can help you get started.
```Python
from datamast.DataAnalyzer import DataAnalyzer
from datamast.makeGCT import makeGCT

#Creating a DataAnalyzer instance
'''
Parameters :
* main_file_path: Path to the main intensity file [Required]
* cohort_file_path: Path to the Cohort file [Optional]
* std_sample_path: Path to the Standard Sample Metadata [Optional]
'''
da = DataAnalyzer(main_file_path=path,cohort_file_path=path2,std_sample_path=path3) #DataAnalyzer instance

#Analyzing the files
'''
Returns a list of five important values
* df: The corresponding dataframe to the input main file
* probable_area_columns: The list of detected area(intensity) columns
* probable_sample: The list of detected sample columns
* probable_comp: The list of detected compound columns
* form: The format of the input main file[ 'wide' or 'long']
'''
df,pa,ps,pc,form = da.analyze()

#Function to get dataframe corresponding to cohort file
cohort_df = da.get_cohort_df()
#Function to get dataframe corresponding to standard metadata sample
std_df = da.get_std_df()

#Creating a makeGCT instance from the outputs of DataAnalyzer.analyze()
'''
Parameters:
* df: The 'df' output of da.analyze() [Required]
* cohort_df: The output of da.get_cohort_df() [Optional]
* std_df: The output of da.get_std_df() [Optional]
* probable_area: The 'probable_area_columns' output of da.analyze() [Required]
* probable_sample: The 'probable_sample' output of da.analyze() [Required]
* probable_comp: The 'probable_comp' output of da.analyze [Required]
* form: The 'form' output of da.analyze() [Required]
'''
mg = makeGCT(df, cohort_df, std_df, pa, ps, pc, form) #makeGCT instance

#Converting the files to a single GCT file by toGCT() method of makeGCT class
gct = mg.toGCT() #Returns a gctoo object
```
---
## Contribute


---
## Credits

---
## License
A short snippet describing the license (MIT, Apache etc)

MIT © [Shantanu Tripathi]()

Project details


Release history Release notifications

This version
History Node

0.4

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
datamast_1-0.4-py2.py3-none-any.whl (8.1 kB) Copy SHA256 hash SHA256 Wheel py2.py3 Jun 28, 2018
datamast_1-0.4.tar.gz (8.0 kB) Copy SHA256 hash SHA256 Source None Jun 28, 2018

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page