Skip to main content

Methods to help track the scripts and datafiles in a project.

Project description

Datatracker

Datatracker is a basic logging Python package that keeps track of files and code within a Project. Each script is logged as an entry and input and output datafiles are recorded in order. Datatracker is able to manage versioning of both files and scripts, and is able to identify the most up-to-date version.

At the moment, this Python package is still in alpha, and I may include changes to both UI and file format that may be breaking.

Installation

To install, run the following command:

pip install git+ssh://git@github.com/TarjinderSingh/datatracker

Usage

New entries

For an entry,

  1. tag is a unique identifier to the script in question and should be clear what the general purpose and output of the script is. (ie Merge is not what we want to see here)
  2. description needs to be one or two sentences equivalent of the Git commit message that thoroughly describes the general purpose and output of the script.
  3. category indicates the general step of analysis the script belongs to.
  4. module is the sub-category for which the script belongs to. Type category_template in interactive Python for an idea of the appropriate categories and modules are.

For a InputFile or OutputFile,

  1. tag is a unique identifier to the File in question and should be clear what the general purpose and output of the script is. (ie Merge is not what we want to see here).
  2. description for a file is a one or two sentences equivalent of the Git commit message that thoroughly describe the general purposes of the File at hand.
from datatracker import *
tr = Tracker()

os.environ['VERSION'] = '0.1.0'

entry = Entry(tag='filter-common-variants',
              description='Filtering common variants in new GWAS data set.',
              category='Processing',
              module='Variant QC')

infile = entry.add(
    InputFile(tag='raw-plink-file',
              path='gs://bucket/raw-plink-file.bed',
              description='Raw PLINK file.'))


outfile = entry.add(
    OutputFile(tag='filt-plink-file',
               path='gs://bucket/raw-plink-file.bed',
               description='Filtered PLINK file.'))

tr.save(entry)

View existing entries

from datatracker import *
tr = Tracker()

tr.table

Use existing entries for pipeline

infile = entry.add(InputFile(entry_tag='filter-common-variants', tag='raw-plink-file', database=tr))

Filter and remove

# filter to entry
tr.filter(tr.entry.tag_version == 'import-array_0.1.6')

# remove entry
tr.remove(tr.entry.tag_version == 'import-array_0.1.6')

Pandas and Excel

df = tr.explode()
df = tr.explode('filt-plink-file')

df = tr.to_pandas()
df = tr.table

df.to_excel('spreadsheet.xlsx')

Data artifacts

infile = entry.add(InputFile(path='gs://checkpoint-cache/tmp/1.bed'))

License

MIT License (see repository)

Maintainer

TJ Singh @ tsingh@broadinstitute.org

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datatracker-0.2.5.tar.gz (27.0 kB view details)

Uploaded Source

Built Distribution

datatracker-0.2.5-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file datatracker-0.2.5.tar.gz.

File metadata

  • Download URL: datatracker-0.2.5.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/26.0 requests/2.25.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.42.1 importlib-metadata/4.11.2 keyring/21.4.0 rfc3986/1.4.0 colorama/0.4.3 CPython/3.7.4

File hashes

Hashes for datatracker-0.2.5.tar.gz
Algorithm Hash digest
SHA256 5cfe6042bffe0342d7ce4bbe5ba1448f7ae7f186bfc9452b9c0f854f02683463
MD5 ada5d6c17838a3c2a33766bb62e5039a
BLAKE2b-256 7809f05f1ccfaac03a7bea0436a1752e787096d662600eeeca8c72175552e489

See more details on using hashes here.

File details

Details for the file datatracker-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: datatracker-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 11.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/26.0 requests/2.25.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.42.1 importlib-metadata/4.11.2 keyring/21.4.0 rfc3986/1.4.0 colorama/0.4.3 CPython/3.7.4

File hashes

Hashes for datatracker-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 2685b3174be894c30b5c34ec7ec7f23caa22e99bdec0493789fafdf81fb9c889
MD5 1407c7bac89486b4033b78695315aacd
BLAKE2b-256 59e15d8fe99c9b6a4e4f44e51f7b3cfc1bb6ca41224ccb62e9dd00baf1dce4a9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page