Skip to main content

Data cleaning functions and pipelines for morphological profiles.

Project description

trommel

This is a collection of clean-up functions and small pipelines for morphological profiling.

A trommel is a revolving cylindrical sieve used for screening or sizing rock and ore, it helps separate the minerals from the waste. This tool aims to fulfill the same purpose for morphological profiling, and possibly many other high-throughput datasets.

Quick Start

import polars as pl
import polars.selectors as cs
from trommel.core import basic_cleanup

meta_selector = cs.by_dtype(pl.String)
profiles = pl.scan_parquet("https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles.parquet", n_rows=100).collect()

"""
shape: (100, 3_677)
┌─────────────────┬────────────────┬───────────────┬───┬─────────────────┬─────────────────┬────────────────┐
│ Metadata_Source ┆ Metadata_Plate ┆ Metadata_Well ┆ … ┆ Nuclei_Texture_ ┆ Nuclei_Texture_ ┆ Nuclei_Texture │
│ ---             ┆ ---            ┆ ---           ┆   ┆ Variance_RNA_5_ ┆ Variance_RNA_5_ ┆ _Variance_RNA_ │
│ str             ┆ str            ┆ str           ┆   ┆ …               ┆ …               ┆ 5_…            │
│                 ┆                ┆               ┆   ┆ ---             ┆ ---             ┆ ---            │
│                 ┆                ┆               ┆   ┆ f32             ┆ f32             ┆ f32            │
╞═════════════════╪════════════════╪═══════════════╪═══╪═════════════════╪═════════════════╪════════════════╡
│ source_13       ┆ CP-CC9-R1-01   ┆ A02           ┆ … ┆ 6.449576        ┆ 6.233986        ┆ 6.447817       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A03           ┆ … ┆ 7.359348        ┆ 7.119856        ┆ 7.359909       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A04           ┆ … ┆ 9.2922          ┆ 8.964124        ┆ 9.255968       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A05           ┆ … ┆ 8.243299        ┆ 7.974916        ┆ 8.25239        │
│ source_13       ┆ CP-CC9-R1-01   ┆ A06           ┆ … ┆ 10.728938       ┆ 10.346541       ┆ 10.691082      │
│ …               ┆ …              ┆ …             ┆ … ┆ …               ┆ …               ┆ …              │
│ source_13       ┆ CP-CC9-R1-01   ┆ E10           ┆ … ┆ 6.414464        ┆ 6.199627        ┆ 6.40822        │
│ source_13       ┆ CP-CC9-R1-01   ┆ E11           ┆ … ┆ 5.445997        ┆ 5.277493        ┆ 5.447682       │
│ source_13       ┆ CP-CC9-R1-01   ┆ E12           ┆ … ┆ 5.501099        ┆ 5.344191        ┆ 5.507084       │
│ source_13       ┆ CP-CC9-R1-01   ┆ E13           ┆ … ┆ 7.312291        ┆ 7.087072        ┆ 7.332959       │
│ source_13       ┆ CP-CC9-R1-01   ┆ E14           ┆ … ┆ 6.326293        ┆ 6.127594        ┆ 6.340693       │
└─────────────────┴────────────────┴───────────────┴───┴─────────────────┴─────────────────┴────────────────┘
"""

cleanup = basic_cleanup(profiles, meta_selector = meta_selector)

"""
shape: (100, 554)
┌─────────────────┬────────────────┬───────────────┬───┬─────────────────┬─────────────────┬────────────────┐
│ Metadata_Source ┆ Metadata_Plate ┆ Metadata_Well ┆ … ┆ Nuclei_Texture_ ┆ Nuclei_Texture_ ┆ Nuclei_Texture │
│ ---             ┆ ---            ┆ ---           ┆   ┆ SumAverage_ER_3 ┆ SumVariance_DNA ┆ _SumVariance_M │
│ str             ┆ str            ┆ str           ┆   ┆ …               ┆ …               ┆ it…            │
│                 ┆                ┆               ┆   ┆ ---             ┆ ---             ┆ ---            │
│                 ┆                ┆               ┆   ┆ f32             ┆ f32             ┆ f32            │
╞═════════════════╪════════════════╪═══════════════╪═══╪═════════════════╪═════════════════╪════════════════╡
│ source_13       ┆ CP-CC9-R1-01   ┆ A02           ┆ … ┆ -0.577417       ┆ -0.138683       ┆ 17.711971      │
│ source_13       ┆ CP-CC9-R1-01   ┆ A03           ┆ … ┆ 0.259718        ┆ -0.028451       ┆ 7.942208       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A04           ┆ … ┆ 0.682264        ┆ -0.001948       ┆ 3.534184       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A05           ┆ … ┆ 0.305402        ┆ -0.032553       ┆ 5.978285       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A06           ┆ … ┆ 0.932589        ┆ 0.086287        ┆ 14.690929      │
│ …               ┆ …              ┆ …             ┆ … ┆ …               ┆ …               ┆ …              │
│ source_13       ┆ CP-CC9-R1-01   ┆ E10           ┆ … ┆ 0.063227        ┆ 0.024047        ┆ -0.151976      │
│ source_13       ┆ CP-CC9-R1-01   ┆ E11           ┆ … ┆ -0.168455       ┆ 0.045889        ┆ -0.012995      │
│ source_13       ┆ CP-CC9-R1-01   ┆ E12           ┆ … ┆ -0.071743       ┆ 0.09979         ┆ -3.231946      │
│ source_13       ┆ CP-CC9-R1-01   ┆ E13           ┆ … ┆ -0.124911       ┆ 0.163038        ┆ 4.087936       │
│ source_13       ┆ CP-CC9-R1-01   ┆ E14           ┆ … ┆ -0.806152       ┆ 0.055316        ┆ -2.082987      │
└─────────────────┴────────────────┴───────────────┴───┴─────────────────┴─────────────────┴────────────────┘
"""

The basic cleanup steps are:

  1. Remove NaNs
  2. Calculate Robust Mean Average Deviation (following pycytominer's implementation)
  3. Remove outliers
  4. Remove redundant (highly correlated) features

Installation

Pip

Aimed towards users of the functions/pipelines.

pip install trommel

Poetry

Predominantly for developers who want to edit the code.

git clone git@github.com:broadinstitute/monorepo.git
cd monorepo/libs/trommel

poetry install

Additional information

Related projects

  • pycytominer: The closest match, but with more complexity imbued and many of the math functions are pandas-centric.
  • EFAAR: Much simpler implementation, but it commits similar hard-coding of cellprofiler features. We instead try to be agnostic to the way of the selectors, but we do commit to using polars.

Future features

  • Full separation of data + metadata
  • Additional processing and clean-up functions
  • Additional default pipelines

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trommel-0.0.1.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trommel-0.0.1-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file trommel-0.0.1.tar.gz.

File metadata

  • Download URL: trommel-0.0.1.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.30

File hashes

Hashes for trommel-0.0.1.tar.gz
Algorithm Hash digest
SHA256 d9143c50b9dbf22c36c76134f7490675cdc37a4110b98854a46ee4e9e6fe8548
MD5 600195eb8cbb2c8fd79470407c37a614
BLAKE2b-256 a25df5e5c8df83186e7926ce00bd71158a833c08af4cef4c27293d43e3dbbc08

See more details on using hashes here.

File details

Details for the file trommel-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: trommel-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.30

File hashes

Hashes for trommel-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 98c7d24afbb51611f7ec3a8a175de0de98acfff71aa876b37d26872dabd4f9f2
MD5 2ff9c7e812bf397f77eeb46e04d12cca
BLAKE2b-256 8235842411978074aff1a255fcf1bac76176f1f526a489941a95af40cace9c39

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page