Skip to main content

Store and access gene expression datasets and gene definitions.

Project description

genedataset is a package to store and access gene expression datasets and gene definitions. It consists of two main classes, geneset and dataset.

Changes in version 1.0

Some significant changes have been made in this version: 1. “MedianTranscriptLength” property of Geneset has been replaced with “TranscriptLengths”, which holds a list of transcript lengths, one per transcript (eg. [2000,1530]). Note that the value contained is a string, so convert into a list of integers before using the value. 2. The Gene annotation has been upgraded to Ensembl version 88. 3. Dataset no longer supports microarrays, so it’s been simplified, so related methods are deprecated. 4. The package supports both python 3 and 2 - tested on 2.7.14 and 3.6.3.

Installation

pip install -e genedataset

geneset

geneset stores gene information combined from both Ensembl and NCBI/Entrez (mouse and human only), so that you can query it:

from genedataset import geneset
gs = geneset.Geneset().subset(queryStrings='ccr3')
print(gs.geneIds())
 ['ENSG00000183625', 'ENSMUSG00000035448']
gs.dataframe()

Ensemb lId

Specie s

Entrez Id

GeneSy mbol

Synony ms

Descri ption

Transc riptLe ngths

Orthol ogue

ENSG00 000183 625

HomoSa piens

1232

CCR3

CC-CKR -3

CD193

CKR3

CMKBR3

ENSMUS G00000 035448

MusMus culus

12771

Ccr3

CC-CKR 3

CKR3

Cmkbr1 l2

Cmkbr3

dataset

dataset can store gene expression data so that it can be queried. The stored data consists of expression values (usually rna-seq) and sample data packaged into HDF5 format.

from genedataset import dataset
ds = dataset.Dataset("genedataset/data/testdata.1.0.h5")
ds
 <Dataset name:testdata 4 samples>
ds.expressionMatrix()

featureId

s01

s02

s03

s04

gene1

3.45

4.65

2.65

8.23

gene2

5.54

0.00

1.43

6.43

gene3

0.00

0.00

4.34

5.44

ds.sampleTable()

sampleId

celltype

tissue

s01

B1

BM

s02

B1

BM

s03

B2

BM

s04

B2

BM

Dataset creation example

Here is an example to create a Dataset file from text files. Once the file has been created, it can be accessed through the Dataset instance. The advantage of this is to store all related information for a dataset in one file, and gives you a python object that can be used for analyses and for application development.

import pandas
from genedataset import dataset
attributes = {"name": "testdata",
              "fullname": "Test Dataset",
              "version": "1.0",
              "description": "This dataset comes with the package for testing purposes.",
              "expression_data_keys": ["counts","cpm"],
              "pubmed_id": None,
              "species": "MusMusculus"}
samples = pandas.DataFrame([['B1', 'BM'], ['B1', 'BM'], ['B2', 'BM'], ['B2', 'BM']],
                           index=['s01','s02','s03','s04'],
                           columns=['celltype','tissue'])
samples.index.name = "sampleId"
counts = pandas.DataFrame([[35, 44, 21, 101], [50, 0, 14, 62], [0, 0, 39, 73]],
                          index=['gene1', 'gene2', 'gene3'], columns=['s01', 's02', 's03', 's04'])
counts.index.name = "geneId"
cpm = pandas.DataFrame([[3.45, 4.65, 2.65, 8.23], [5.54, 0.0, 1.43, 6.43], [0.0, 0.0, 4.34, 5.44]],
                       index=['gene1', 'gene2', 'gene3'], columns=['s01', 's02', 's03', 's04'])
cpm.index.name = "geneId"
dataset.createDatasetFile("/datasets", attributes=attributes, samples=samples, expressions=[counts,cpm])

Contact

Jarny Choi, University of Melbourne

Changes

  • v1.0 - Major upgrade.

  • v0.1.x - Initial release with minor adjustments to test pypi and github upload/download.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genedataset-1.0.0a2.tar.gz (4.1 MB view hashes)

Uploaded Source

Built Distribution

genedataset-1.0.0a2-py2.py3-none-any.whl (4.2 MB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page