Skip to main content

Data cleaning functions and pipelines for morphological profiles.

Project description

trommel

This is a collection of clean-up functions and small pipelines for morphological profiling.

A trommel is a revolving cylindrical sieve used for screening or sizing rock and ore, it helps separate the minerals from the waste. This tool aims to fulfill the same purpose for morphological profiling, and possibly many other high-throughput datasets.

Quick Start

import polars as pl
import polars.selectors as cs
from trommel.core import basic_cleanup

meta_selector = cs.by_dtype(pl.String)
profiles = pl.scan_parquet("https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles.parquet", n_rows=100).collect()

"""
shape: (100, 3_677)
┌─────────────────┬────────────────┬───────────────┬───┬─────────────────┬─────────────────┬────────────────┐
│ Metadata_Source ┆ Metadata_Plate ┆ Metadata_Well ┆ … ┆ Nuclei_Texture_ ┆ Nuclei_Texture_ ┆ Nuclei_Texture │
│ ---             ┆ ---            ┆ ---           ┆   ┆ Variance_RNA_5_ ┆ Variance_RNA_5_ ┆ _Variance_RNA_ │
│ str             ┆ str            ┆ str           ┆   ┆ …               ┆ …               ┆ 5_…            │
│                 ┆                ┆               ┆   ┆ ---             ┆ ---             ┆ ---            │
│                 ┆                ┆               ┆   ┆ f32             ┆ f32             ┆ f32            │
╞═════════════════╪════════════════╪═══════════════╪═══╪═════════════════╪═════════════════╪════════════════╡
│ source_13       ┆ CP-CC9-R1-01   ┆ A02           ┆ … ┆ 6.449576        ┆ 6.233986        ┆ 6.447817       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A03           ┆ … ┆ 7.359348        ┆ 7.119856        ┆ 7.359909       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A04           ┆ … ┆ 9.2922          ┆ 8.964124        ┆ 9.255968       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A05           ┆ … ┆ 8.243299        ┆ 7.974916        ┆ 8.25239        │
│ source_13       ┆ CP-CC9-R1-01   ┆ A06           ┆ … ┆ 10.728938       ┆ 10.346541       ┆ 10.691082      │
│ …               ┆ …              ┆ …             ┆ … ┆ …               ┆ …               ┆ …              │
│ source_13       ┆ CP-CC9-R1-01   ┆ E10           ┆ … ┆ 6.414464        ┆ 6.199627        ┆ 6.40822        │
│ source_13       ┆ CP-CC9-R1-01   ┆ E11           ┆ … ┆ 5.445997        ┆ 5.277493        ┆ 5.447682       │
│ source_13       ┆ CP-CC9-R1-01   ┆ E12           ┆ … ┆ 5.501099        ┆ 5.344191        ┆ 5.507084       │
│ source_13       ┆ CP-CC9-R1-01   ┆ E13           ┆ … ┆ 7.312291        ┆ 7.087072        ┆ 7.332959       │
│ source_13       ┆ CP-CC9-R1-01   ┆ E14           ┆ … ┆ 6.326293        ┆ 6.127594        ┆ 6.340693       │
└─────────────────┴────────────────┴───────────────┴───┴─────────────────┴─────────────────┴────────────────┘
"""

cleanup = basic_cleanup(profiles, meta_selector = meta_selector)

"""
shape: (100, 554)
┌─────────────────┬────────────────┬───────────────┬───┬─────────────────┬─────────────────┬────────────────┐
│ Metadata_Source ┆ Metadata_Plate ┆ Metadata_Well ┆ … ┆ Nuclei_Texture_ ┆ Nuclei_Texture_ ┆ Nuclei_Texture │
│ ---             ┆ ---            ┆ ---           ┆   ┆ SumAverage_ER_3 ┆ SumVariance_DNA ┆ _SumVariance_M │
│ str             ┆ str            ┆ str           ┆   ┆ …               ┆ …               ┆ it…            │
│                 ┆                ┆               ┆   ┆ ---             ┆ ---             ┆ ---            │
│                 ┆                ┆               ┆   ┆ f32             ┆ f32             ┆ f32            │
╞═════════════════╪════════════════╪═══════════════╪═══╪═════════════════╪═════════════════╪════════════════╡
│ source_13       ┆ CP-CC9-R1-01   ┆ A02           ┆ … ┆ -0.577417       ┆ -0.138683       ┆ 17.711971      │
│ source_13       ┆ CP-CC9-R1-01   ┆ A03           ┆ … ┆ 0.259718        ┆ -0.028451       ┆ 7.942208       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A04           ┆ … ┆ 0.682264        ┆ -0.001948       ┆ 3.534184       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A05           ┆ … ┆ 0.305402        ┆ -0.032553       ┆ 5.978285       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A06           ┆ … ┆ 0.932589        ┆ 0.086287        ┆ 14.690929      │
│ …               ┆ …              ┆ …             ┆ … ┆ …               ┆ …               ┆ …              │
│ source_13       ┆ CP-CC9-R1-01   ┆ E10           ┆ … ┆ 0.063227        ┆ 0.024047        ┆ -0.151976      │
│ source_13       ┆ CP-CC9-R1-01   ┆ E11           ┆ … ┆ -0.168455       ┆ 0.045889        ┆ -0.012995      │
│ source_13       ┆ CP-CC9-R1-01   ┆ E12           ┆ … ┆ -0.071743       ┆ 0.09979         ┆ -3.231946      │
│ source_13       ┆ CP-CC9-R1-01   ┆ E13           ┆ … ┆ -0.124911       ┆ 0.163038        ┆ 4.087936       │
│ source_13       ┆ CP-CC9-R1-01   ┆ E14           ┆ … ┆ -0.806152       ┆ 0.055316        ┆ -2.082987      │
└─────────────────┴────────────────┴───────────────┴───┴─────────────────┴─────────────────┴────────────────┘
"""

The basic cleanup steps are:

  1. Remove NaNs
  2. Calculate Robust Mean Average Deviation (following pycytominer's implementation)
  3. Remove outliers
  4. Remove redundant (highly correlated) features

Installation

Pip

Aimed towards users of the functions/pipelines.

pip install trommel

Poetry

Predominantly for developers who want to edit the code.

git clone git@github.com:broadinstitute/monorepo.git
cd monorepo/libs/trommel

poetry install

Additional information

Related projects

  • pycytominer: The closest match, but with more complexity imbued and many of the math functions are pandas-centric.
  • EFAAR: Much simpler implementation, but it commits similar hard-coding of cellprofiler features. We instead try to be agnostic to the way of the selectors, but we do commit to using polars.

Future features

  • Full separation of data + metadata
  • Additional processing and clean-up functions
  • Additional default pipelines

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trommel-0.1.5.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trommel-0.1.5-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file trommel-0.1.5.tar.gz.

File metadata

  • Download URL: trommel-0.1.5.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.30

File hashes

Hashes for trommel-0.1.5.tar.gz
Algorithm Hash digest
SHA256 886c7fb39f10163399180b9dcb496f1065d2d1d906a1973cd3e5af01b41e44c3
MD5 20dcf90ba4e6bb749c193bf653012cbe
BLAKE2b-256 7e50b0f96dc5c7305fad892b598cb9bb476c54114fe1bb39c66c97f43217b390

See more details on using hashes here.

File details

Details for the file trommel-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: trommel-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.30

File hashes

Hashes for trommel-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9e80095ca129215d68ee6c2204c803291cb164e4b138dc922d356d018244f2a2
MD5 d763afa5d10ca5997c38a2755014de4c
BLAKE2b-256 92bfba826085c0c26a86aa0def6a7016085b7f62fbb6fb762d01fe9f37c664e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page