Skip to main content

Script to load CSR data to TranSMART

Project description

Build status codecov PyPI PyPI - Status MIT license

This package contains a script that transforms Central Subject Registry data to a format that can be loaded into TranSMART platform, an open source data sharing and analytics platform for translational biomedical research.

The output of the transformation is a collection of tab-separated files that can be loaded into a TranSMART database using the transmart-copy tool.

⚠️ Note: this is a very preliminary version, still under development. Issues can be reported at https://github.com/thehyve/python_csr2transmart/issues.

Installation and usage

To install csr2transmart, do:

pip install csr2transmart

or from sources:

git clone https://github.com/thehyve/python_csr2transmart.git
cd python_csr2transmart
pip install .

Data model

The Central Subject Registry (CSR) data model contains individual, diagnosis, biosource and biomaterial entities. The data model is defined as a data class in csr/csr.py

Usage

This repository contains a number of command line tools:

  • sources2csr: Reads from source files and produces tab delimited CSR files.

  • csr2transmart: Reads CSR files and transforms the data to the TranSMART data model, creating files that can be imported to TranSMART using transmart-copy.

  • csr2cbioportal: Reads CSR files and transforms the data to patient and sample files to imported into cBioPortal.

sources2csr

sources2csr <input_dir> <output_dir> <config_dir>

The tool reads input files from <input_dir> and writes CSR files in tab delimited format (one file per entity type) to <output_dir>. The output directory <output_dir> needs to be either empty or not yet existing.

The sources configuration will be read from <config_dir>/sources_config.json, a JSON file that contains two attributes:

  • entities: a map from entity type name to a description of the sources for that entity type. E.g.,

    {
      "Individual": {
        "attributes": [
          {
            "name": "individual_id",
            "sources": [
              {
                "file": "individual.tsv",
                "column": "individual_id"
              }
            ]
          },
          {
            "name": "birth_date",
            "sources": [
              {
                "file": "individual.tsv",
                "date_format": "%d-%m-%Y"
              }
            ]
          }
        ]
      }
    }

    The entity type names have to match the entity type names in the CSR data model and the attribute names should match the attribute names in the data model as well. The column field is optional, by default the column name is assumed to be the same as the attribute name. For date fields, a date_format can be specified. If not specified, it is assumed to be %Y-%m-%d or any other date formats supported by Pydantic. If multiple input files are specified for an attribute, data for that attribute is read in that order, i.e., only if the first file has no data for an attribute for a specific entity, data for that attribute for that entity is read from the next file, etc.

  • codebooks: a map from input file name to codebook file name, e.g., {"individual.tsv": "codebook.txt"}.

  • file_format: a map from input file name to file format configuration, which allows to configure the delimiter character (default: \t). E.g., {"individual.tsv": {"delimiter": ","}}.

See test_data/input_data/config/sources_config.json for an example.

Content of the codebook files has to match the following format:

  • First a header line with a number and column names the codes apply to. The first field has a number, the second field a space separated list of column names, e.g., 1\tSEX GENDER.

  • The lines following the header start with an empty field. Then the lines follow the format of code\tvalue until the end of the line, e.g., \t1\tMale\t2\tFemale.

  • The start of a new header, which is detected by the first field not being empty starts the process over again.

See test_data/input_data/codebooks/valid_codebook.txt for a codebook file example.

csr2transmart

csr2transmart <input_dir> <output_dir> <config_dir>

The tool reads CSR files from <input_dir> (one file per entity type), transforms the CSR data to the TranSMART data model. In addition, if there is an NGS folder inside <input_dir>, the tool will read the NGS files inside to determine values of additional CSR biomaterial variables. The tool writes the output in transmart-copy format to <output_dir>. The output directory <output_dir> needs to be either empty or not yet existing.

The ontology configuration will be read from <config_dir>/ontology_config.json. See test_data/input_data/config/ontology_config.json for an example.

csr2cbioportal

csr2cbioportal <input_dir> [--ngs-dir <ngs_dir>] <output_dir>

The tool reads CSR files from <input_dir> (one file per entity type), and optionally NGS data (genomics data) from <ngs_dir>, transforms the CSR data to the clinical data format for cBioPortal and writes the following data types to <output_dir>:

  • Clinical data

  • Mutation data

  • CNA Segment data

  • CNA Continuous data

  • CNA Discrete data

File structure, case lists and meta files will also be also added in the output folder. See the cBioPortal file formats documentation for further details.

The output directory <output_dir> needs to be either empty or not yet existing.

Python versions

This package supports Python versions 3.6 and 3.7.

Package management and dependencies

This project uses pip for installing dependencies and package management.

Testing and code coverage

  • Tests are in the tests folder.

  • The tests folder contains tests for each of the tools and a test that checks whether your code conforms to the Python style guide (PEP 8) (file: test_lint.py)

  • The testing framework used is PyTest

  • Tests can be run with python setup.py test

Coding style conventions and code quality

  • Check your code style with prospector

  • You may need run pip install .[dev] first, to install the required dependencies

License

Copyright (c) 2019 The Hyve B.V.

The CSR to TranSMART loader is licensed under the MIT License. See the file LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csr2transmart-0.0.20.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

csr2transmart-0.0.20-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file csr2transmart-0.0.20.tar.gz.

File metadata

  • Download URL: csr2transmart-0.0.20.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.7

File hashes

Hashes for csr2transmart-0.0.20.tar.gz
Algorithm Hash digest
SHA256 3d324a590e7297c88515a266dcc124d58bc532027dd382a24f285d894a69a2c5
MD5 346ac330ad2e1f60f9c54ef3f28076a3
BLAKE2b-256 f6657f3e98309fcbca2a462faefbae7118b50e2baf7970adf62bb6b3be4d4b43

See more details on using hashes here.

File details

Details for the file csr2transmart-0.0.20-py3-none-any.whl.

File metadata

  • Download URL: csr2transmart-0.0.20-py3-none-any.whl
  • Upload date:
  • Size: 40.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.7

File hashes

Hashes for csr2transmart-0.0.20-py3-none-any.whl
Algorithm Hash digest
SHA256 4116e45282d22b73f4208a3bd131ac02c699dbe4a36ba71d16a7facf6456e2c4
MD5 447ab4ffd67da4a26978dac8bc5ecf32
BLAKE2b-256 2014c92f20f9de512fe1da99c981aa6e844b918d7cd0e46806cfce6e9eaf42e6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page