Skip to main content

Data cleaning functions and pipelines for morphological profiles.

Project description

trommel

This is a collection of clean-up functions and small pipelines for morphological profiling.

A trommel is a revolving cylindrical sieve used for screening or sizing rock and ore, it helps separate the minerals from the waste. This tool aims to fulfill the same purpose for morphological profiling, and possibly many other high-throughput datasets.

Quick Start

import polars as pl
import polars.selectors as cs
from trommel.core import basic_cleanup

meta_selector = cs.by_dtype(pl.String)
profiles = pl.scan_parquet("https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles.parquet", n_rows=100).collect()

"""
shape: (100, 3_677)
┌─────────────────┬────────────────┬───────────────┬───┬─────────────────┬─────────────────┬────────────────┐
│ Metadata_Source ┆ Metadata_Plate ┆ Metadata_Well ┆ … ┆ Nuclei_Texture_ ┆ Nuclei_Texture_ ┆ Nuclei_Texture │
│ ---             ┆ ---            ┆ ---           ┆   ┆ Variance_RNA_5_ ┆ Variance_RNA_5_ ┆ _Variance_RNA_ │
│ str             ┆ str            ┆ str           ┆   ┆ …               ┆ …               ┆ 5_…            │
│                 ┆                ┆               ┆   ┆ ---             ┆ ---             ┆ ---            │
│                 ┆                ┆               ┆   ┆ f32             ┆ f32             ┆ f32            │
╞═════════════════╪════════════════╪═══════════════╪═══╪═════════════════╪═════════════════╪════════════════╡
│ source_13       ┆ CP-CC9-R1-01   ┆ A02           ┆ … ┆ 6.449576        ┆ 6.233986        ┆ 6.447817       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A03           ┆ … ┆ 7.359348        ┆ 7.119856        ┆ 7.359909       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A04           ┆ … ┆ 9.2922          ┆ 8.964124        ┆ 9.255968       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A05           ┆ … ┆ 8.243299        ┆ 7.974916        ┆ 8.25239        │
│ source_13       ┆ CP-CC9-R1-01   ┆ A06           ┆ … ┆ 10.728938       ┆ 10.346541       ┆ 10.691082      │
│ …               ┆ …              ┆ …             ┆ … ┆ …               ┆ …               ┆ …              │
│ source_13       ┆ CP-CC9-R1-01   ┆ E10           ┆ … ┆ 6.414464        ┆ 6.199627        ┆ 6.40822        │
│ source_13       ┆ CP-CC9-R1-01   ┆ E11           ┆ … ┆ 5.445997        ┆ 5.277493        ┆ 5.447682       │
│ source_13       ┆ CP-CC9-R1-01   ┆ E12           ┆ … ┆ 5.501099        ┆ 5.344191        ┆ 5.507084       │
│ source_13       ┆ CP-CC9-R1-01   ┆ E13           ┆ … ┆ 7.312291        ┆ 7.087072        ┆ 7.332959       │
│ source_13       ┆ CP-CC9-R1-01   ┆ E14           ┆ … ┆ 6.326293        ┆ 6.127594        ┆ 6.340693       │
└─────────────────┴────────────────┴───────────────┴───┴─────────────────┴─────────────────┴────────────────┘
"""

cleanup = basic_cleanup(profiles, meta_selector = meta_selector)

"""
shape: (100, 554)
┌─────────────────┬────────────────┬───────────────┬───┬─────────────────┬─────────────────┬────────────────┐
│ Metadata_Source ┆ Metadata_Plate ┆ Metadata_Well ┆ … ┆ Nuclei_Texture_ ┆ Nuclei_Texture_ ┆ Nuclei_Texture │
│ ---             ┆ ---            ┆ ---           ┆   ┆ SumAverage_ER_3 ┆ SumVariance_DNA ┆ _SumVariance_M │
│ str             ┆ str            ┆ str           ┆   ┆ …               ┆ …               ┆ it…            │
│                 ┆                ┆               ┆   ┆ ---             ┆ ---             ┆ ---            │
│                 ┆                ┆               ┆   ┆ f32             ┆ f32             ┆ f32            │
╞═════════════════╪════════════════╪═══════════════╪═══╪═════════════════╪═════════════════╪════════════════╡
│ source_13       ┆ CP-CC9-R1-01   ┆ A02           ┆ … ┆ -0.577417       ┆ -0.138683       ┆ 17.711971      │
│ source_13       ┆ CP-CC9-R1-01   ┆ A03           ┆ … ┆ 0.259718        ┆ -0.028451       ┆ 7.942208       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A04           ┆ … ┆ 0.682264        ┆ -0.001948       ┆ 3.534184       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A05           ┆ … ┆ 0.305402        ┆ -0.032553       ┆ 5.978285       │
│ source_13       ┆ CP-CC9-R1-01   ┆ A06           ┆ … ┆ 0.932589        ┆ 0.086287        ┆ 14.690929      │
│ …               ┆ …              ┆ …             ┆ … ┆ …               ┆ …               ┆ …              │
│ source_13       ┆ CP-CC9-R1-01   ┆ E10           ┆ … ┆ 0.063227        ┆ 0.024047        ┆ -0.151976      │
│ source_13       ┆ CP-CC9-R1-01   ┆ E11           ┆ … ┆ -0.168455       ┆ 0.045889        ┆ -0.012995      │
│ source_13       ┆ CP-CC9-R1-01   ┆ E12           ┆ … ┆ -0.071743       ┆ 0.09979         ┆ -3.231946      │
│ source_13       ┆ CP-CC9-R1-01   ┆ E13           ┆ … ┆ -0.124911       ┆ 0.163038        ┆ 4.087936       │
│ source_13       ┆ CP-CC9-R1-01   ┆ E14           ┆ … ┆ -0.806152       ┆ 0.055316        ┆ -2.082987      │
└─────────────────┴────────────────┴───────────────┴───┴─────────────────┴─────────────────┴────────────────┘
"""

The basic cleanup steps are:

  1. Remove NaNs
  2. Calculate Robust Mean Average Deviation (following pycytominer's implementation)
  3. Remove outliers
  4. Remove redundant (highly correlated) features

Installation

Pip

Aimed towards users of the functions/pipelines.

pip install trommel

Poetry

Predominantly for developers who want to edit the code.

git clone git@github.com:broadinstitute/monorepo.git
cd monorepo/libs/trommel

poetry install

Additional information

Related projects

  • pycytominer: The closest match, but with more complexity imbued and many of the math functions are pandas-centric.
  • EFAAR: Much simpler implementation, but it commits similar hard-coding of cellprofiler features. We instead try to be agnostic to the way of the selectors, but we do commit to using polars.

Future features

  • Full separation of data + metadata
  • Additional processing and clean-up functions
  • Additional default pipelines

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trommel-0.1.4.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trommel-0.1.4-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file trommel-0.1.4.tar.gz.

File metadata

  • Download URL: trommel-0.1.4.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.30

File hashes

Hashes for trommel-0.1.4.tar.gz
Algorithm Hash digest
SHA256 fd26aefd05ca6dd94a0286fe19899b61696502f017675f2a4fc5bd35b95ddbd6
MD5 eb36ec202f6309a1682da9f49149a6aa
BLAKE2b-256 b99286e81aa6ffbb789faec4bb5869049deca0f287900d49374daf0b29610fb0

See more details on using hashes here.

File details

Details for the file trommel-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: trommel-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 4.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.30

File hashes

Hashes for trommel-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 487ba93436ead2be0f90a6f84e746de63dad54b6688b761afca16c8e3ce343a9
MD5 c79c204f47edcf5e778bd53906c7b91a
BLAKE2b-256 a8319b22ebb15ba68f9fb4e02c1863a9c607f5515c9730e2f9fd1cf6f7c67dc4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page