Skip to main content

A data layer for single-cell, spatial and bulk immunomics, providing a unified structure for immune receptor repertoire data.

Project description

immundata-py

An efficient data framework for single-cell and bulk immune repertoire datasets of practically any scale.

Think AnnData, SingleCellExperiment or Seurat object, but for AIRR with the full support for out-of-memory datasets and easier access to additional receptor data such as gene expression from single-cell transcriptomics files, spatial data coordinates, or antigen specificity data, provided by user. The goal of immundata is to standardize I/O and basic data manipulation, following the AIRR Community Data Standard for immune repertoire representation. It's primary users are bioinformatics developers and data engineers who don't want to write from scratch an abstraction layer over the data. Biologists and medical scientists could benefit as well, considering they learn the code syntaxis. However, the overall philosophy is to make sure that immune repertoire data analysis tools such as immunarch cover more than 80% of use cases without explicitly using immundata by the end user.

Installation

poetry install

Usage

See notebooks/immundata-experiments.ipynb.

Challenges and use cases

Plans

Benchmark before each release?

  1. bulk, single chain
  2. single-cell, single chain
  3. Release 0.1.0
  4. bulk, paired chain
  5. single-cell, paired chain
  6. Release 0.2.0
  7. add new receptor metadata
  8. write some metadata back to the single cell using barcodes
  9. Release 0.3.0
  10. Decide the priority later: multiple metadata sources, like single-cell + spatial
  11. Decide the priority later: merge multiple immundatas, so support for multiple data sources
  12. Decide the priority later: optimizations

I/O

Processing input files

Some input file may require processing. Like adding a sample_id column or something like that.

Overall, saving to parquet could increase any computations in contrast to reading from csv.

data management 1 - dump raw data to parquet for better future analysis. Just this huge parquet file with all the receptors, or several parquet files.

data management 2 - repertoire files with built receptors

What is the cost and frequency of building a repertoire with new receptor signature vs. building with a new sample signature?

  • receptor signature - pricy, long, rare
  • sample signature - ??? do we really need to often re-create samples? But it should be much less costly

Receptor building:

  • build unique receptors
  • filter out / save non-coding
  • figure out how to process multiple chain sequence data
  • normalize V/D/J genes, remove/move to a separate column list of segments, leave only the important ones using some strategy (like take first)

Validate pre-input data / connect to external database as a source of truth instead of parquet

  1. raw data -> build receptors -> dump to parquet
  2. raw data -> dump to parquet[raw] -> build receptors -> dump to parquet[receptor model]
  3. database[raw] with some table -> build receptors -> dump to parquet[receptor model]
  4. database[receptor model]

The problem is that the source[receptor model] must follow the receptor model and the ImmunData format with multiple tables + allow scanning.

Data operations

Metadata storing

If we have a unique receptor pair, how do we store different metadata values coming from different samples? mvalue == metadata value, like gene expression or immunogenicity uid == unique receptor id barcode == unique cell id

Simple version: table1: barcode - receptor - sample id - mvalue

Complex version: table1: uid - receptor - one hot encoding for multiple samples table2: barcode -> mvalues for this receptor from some sample table3: barcode -> uid in this case, immundata is a manager that accurately pre-filters data and concats together necessary columns.

This is virtually a task of database normalization. https://www.freecodecamp.org/news/database-normalization-1nf-2nf-3nf-table-examples/

  • The First Normal Form – 1NF

    • For a table to be in the first normal form, it must meet the following criteria:
      • a single cell must not hold more than one value (atomicity)
      • there must be a primary key for identification
      • no duplicated rows or columns
      • each column must have only one value for each row in the table
  • The Second Normal Form – 2NF

    • The 1NF only eliminates repeating groups, not redundancy. That’s why there is 2NF.
    • A table is said to be in 2NF if it meets the following criteria:
      • it’s already in 1NF
      • has no partial dependency. That is, all non-key attributes are fully dependent on a primary key.
  • The Third Normal Form – 3NF

    • When a table is in 2NF, it eliminates repeating groups and redundancy, but it does not eliminate transitive partial dependency.
    • This means a non-prime attribute (an attribute that is not part of the candidate’s key) is dependent on another non-prime attribute.
    • This is what the third normal form (3NF) eliminates.
    • So, for a table to be in 3NF, it must:
      • be in 2NF
      • have no transitive partial dependency.

Metadata operations

select samples (==all receptors from a sample) with this specific property looks like a little bit more complex filter. Like filter_sample() that runs on samples only. But the typical filter can do that as well, if receptors contain repertoire-level metadata. Moving from sample-level to receptor-level metadata makes our lives much easier. We should just probably store the metadata columns. Theoretically, they are store automatically if we create a sample signatures. Maybe signature + additional information?

Computations on sample metadata

Select all samples that we encounter in specific repertoire group more than 3 times.

How to process paired data

One big table with missing columns? Several tables? Tabular schema written to ImmunData.schema.receptor?

Caching

Run the analysis and cache the results data assigning stuff to ImmunData: immun_data_new = immun_data_old.filter(blah blah)

Or should we do it? Hmm. immun_data_new.cache()? immun_data_new.execute()?

Modalities

Writing back after analysis

filter clonotypes -> do some analysis like hyperexpansion -> write back the metadata that some clones are expanded not sure how to do that because we don't save the UIDs. Or ARE we??? unique_receptors (or just receptors in case of public repertoire) can return all the UIDs. And we can have a function that writes this column back

New problem then: we write the data (lazy). Do we force the caching? Do we wait? What if the user removes this variable? [!!!]

How to create a fast / convenient AIRR <-> SC layers data loading and writing

DECISION: Becuase we can extract barcodes from the original data quite easily, we could forget it and just go with barcodes extraction via AnnData whatever pushed on user. Same with writing some info back.

View into the data Get the list of genes / Or process the receptors, get barcodes, extract data with merging strategy (mean, median, fun), and then do stuff

bc_vec = scdata.select(IL2 >= 2)
imdata.filter(bc_vec).filter(...)
# OR
bc_vec = imdata.scdata.select(IL2 >= 2)
imdata.filter(bc_vec).filter(...)

imdata.filter(sequence == "CDR3blahlbha").extract("IL2", "IL12", combine="mean")

imdata.extract("IL2", "IL12", combine="mean").filter(IL2 >= 5)

imdata.filter(sc.IL2 >= 5, combine="mean")

`sc` - data source slot
main data source slot, other data source slot

imdata.data_slot["123"].some_operation(blah blah filter by) => barcodes => .build_receptors(strategy = "mean").filter().analyze()

Advanced analysis

Convenient sequence distance calculations

Other thoughts and notes

Operation that change the clonotype table structure (select columns), group them (group_by) and doesn't change (filter, mutate)

Do we need to support the clonotypes? Probably, it's all about barcodes

What about supporting the state for ProcessedImmunData, i.e., saving the info about clonotype models and clonotype ids?

Clonotype model = heavy computations and quite foundational for the analysis; this is what you think a data point is.

Repertoire model = no computations; this a view on your data, a grouping that you want to be light an easy to work with

Optional format / full format – less columns / full format

We need to rewrite sample_id with the filename if the file's sample ID is bad

Benchmarking

  • many small repertoires (single-cell use case)
  • several small repertoires (WTF use case)
  • several big repertoires (typical RepSeq use case)
  • many big repertoires (we are heading there)

Pre-ordering: https://duckdb.org/docs/guides/performance/indexing

To readers:

  • strategy for dealing with samples
  • callback for processing file name

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

immundata-0.1.0.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

immundata-0.1.0-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file immundata-0.1.0.tar.gz.

File metadata

  • Download URL: immundata-0.1.0.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for immundata-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d2cd7b524673d49a44dbc113c5ae499f657896400c5c1ddc7048a356ac4df492
MD5 24cee59e528c052fe079a599e12109d4
BLAKE2b-256 1ff148333e2ec72efab137fa5068cefcd6a80287656ec780f8161df5bb019050

See more details on using hashes here.

File details

Details for the file immundata-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: immundata-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for immundata-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6ab85c6a71cf5659fb5042e4597b43ae18e85282ab763cb1b06fa2afd8e5c9a3
MD5 6049d12c535d81fae9148661c30290d9
BLAKE2b-256 aa307cc20b10f6019c7729e077ee396b9542826e5666cec619c9d91ca63bec0d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page