A data layer for single-cell, spatial and bulk immunomics, providing a unified structure for immune receptor repertoire data.

Project description

immundata-py

An efficient data framework for single-cell and bulk immune repertoire datasets of practically any scale.

Think AnnData, SingleCellExperiment or Seurat object, but for AIRR with the full support for out-of-memory datasets and easier access to additional receptor data such as gene expression from single-cell transcriptomics files, spatial data coordinates, or antigen specificity data, provided by user. The goal of immundata is to standardize I/O and basic data manipulation, following the AIRR Community Data Standard for immune repertoire representation. It's primary users are bioinformatics developers and data engineers who don't want to write from scratch an abstraction layer over the data. Biologists and medical scientists could benefit as well, considering they learn the code syntaxis. However, the overall philosophy is to make sure that immune repertoire data analysis tools such as immunarch cover more than 80% of use cases without explicitly using immundata by the end user.

Installation

poetry install

Usage

See notebooks/immundata-experiments.ipynb.

Challenges and use cases

Plans

Benchmark before each release?

bulk, single chain
single-cell, single chain
Release 0.1.0
bulk, paired chain
single-cell, paired chain
Release 0.2.0
add new receptor metadata
write some metadata back to the single cell using barcodes
Release 0.3.0
Decide the priority later: multiple metadata sources, like single-cell + spatial
Decide the priority later: merge multiple immundatas, so support for multiple data sources
Decide the priority later: optimizations

I/O

Processing input files

Some input file may require processing. Like adding a sample_id column or something like that.

Overall, saving to parquet could increase any computations in contrast to reading from csv.

data management 1 - dump raw data to parquet for better future analysis. Just this huge parquet file with all the receptors, or several parquet files.

data management 2 - repertoire files with built receptors

What is the cost and frequency of building a repertoire with new receptor signature vs. building with a new sample signature?

receptor signature - pricy, long, rare
sample signature - ??? do we really need to often re-create samples? But it should be much less costly

Receptor building:

build unique receptors
filter out / save non-coding
figure out how to process multiple chain sequence data
normalize V/D/J genes, remove/move to a separate column list of segments, leave only the important ones using some strategy (like take first)

Validate pre-input data / connect to external database as a source of truth instead of parquet

raw data -> build receptors -> dump to parquet
raw data -> dump to parquet[raw] -> build receptors -> dump to parquet[receptor model]
database[raw] with some table -> build receptors -> dump to parquet[receptor model]
database[receptor model]

The problem is that the source[receptor model] must follow the receptor model and the ImmunData format with multiple tables + allow scanning.

Data operations

Metadata storing

If we have a unique receptor pair, how do we store different metadata values coming from different samples? mvalue == metadata value, like gene expression or immunogenicity uid == unique receptor id barcode == unique cell id

Simple version: table1: barcode - receptor - sample id - mvalue

Complex version: table1: uid - receptor - one hot encoding for multiple samples table2: barcode -> mvalues for this receptor from some sample table3: barcode -> uid in this case, immundata is a manager that accurately pre-filters data and concats together necessary columns.

This is virtually a task of database normalization. https://www.freecodecamp.org/news/database-normalization-1nf-2nf-3nf-table-examples/

The First Normal Form – 1NF
- For a table to be in the first normal form, it must meet the following criteria:
  - a single cell must not hold more than one value (atomicity)
  - there must be a primary key for identification
  - no duplicated rows or columns
  - each column must have only one value for each row in the table
The Second Normal Form – 2NF
- The 1NF only eliminates repeating groups, not redundancy. That’s why there is 2NF.
- A table is said to be in 2NF if it meets the following criteria:
  - it’s already in 1NF
  - has no partial dependency. That is, all non-key attributes are fully dependent on a primary key.
The Third Normal Form – 3NF
- When a table is in 2NF, it eliminates repeating groups and redundancy, but it does not eliminate transitive partial dependency.
- This means a non-prime attribute (an attribute that is not part of the candidate’s key) is dependent on another non-prime attribute.
- This is what the third normal form (3NF) eliminates.
- So, for a table to be in 3NF, it must:
  - be in 2NF
  - have no transitive partial dependency.

Metadata operations

select samples (==all receptors from a sample) with this specific property looks like a little bit more complex filter. Like filter_sample() that runs on samples only. But the typical filter can do that as well, if receptors contain repertoire-level metadata. Moving from sample-level to receptor-level metadata makes our lives much easier. We should just probably store the metadata columns. Theoretically, they are store automatically if we create a sample signatures. Maybe signature + additional information?

Computations on sample metadata

Select all samples that we encounter in specific repertoire group more than 3 times.

How to process paired data

One big table with missing columns? Several tables? Tabular schema written to ImmunData.schema.receptor?

Caching

Run the analysis and cache the results data assigning stuff to ImmunData: immun_data_new = immun_data_old.filter(blah blah)

Or should we do it? Hmm. immun_data_new.cache()? immun_data_new.execute()?

Modalities

Writing back after analysis

filter clonotypes -> do some analysis like hyperexpansion -> write back the metadata that some clones are expanded not sure how to do that because we don't save the UIDs. Or ARE we??? unique_receptors (or just receptors in case of public repertoire) can return all the UIDs. And we can have a function that writes this column back

New problem then: we write the data (lazy). Do we force the caching? Do we wait? What if the user removes this variable? [!!!]

How to create a fast / convenient AIRR <-> SC layers data loading and writing

DECISION: Becuase we can extract barcodes from the original data quite easily, we could forget it and just go with barcodes extraction via AnnData whatever pushed on user. Same with writing some info back.

View into the data Get the list of genes / Or process the receptors, get barcodes, extract data with merging strategy (mean, median, fun), and then do stuff

bc_vec = scdata.select(IL2 >= 2)
imdata.filter(bc_vec).filter(...)
# OR
bc_vec = imdata.scdata.select(IL2 >= 2)
imdata.filter(bc_vec).filter(...)

imdata.filter(sequence == "CDR3blahlbha").extract("IL2", "IL12", combine="mean")

imdata.extract("IL2", "IL12", combine="mean").filter(IL2 >= 5)

imdata.filter(sc.IL2 >= 5, combine="mean")

`sc` - data source slot
main data source slot, other data source slot

imdata.data_slot["123"].some_operation(blah blah filter by) => barcodes => .build_receptors(strategy = "mean").filter().analyze()

Advanced analysis

Convenient sequence distance calculations

Use case 1 - give a string and find similar ones with a cutoff for max length
- https://docs.rs/rapidfuzz/latest/rapidfuzz/
- https://github.com/ion-elgreco/polars-distance/tree/main/polars_distance
Use case 2 - compute pairwise distances and build a graph

Other thoughts and notes

Operation that change the clonotype table structure (select columns), group them (group_by) and doesn't change (filter, mutate)

Do we need to support the clonotypes? Probably, it's all about barcodes

What about supporting the state for ProcessedImmunData, i.e., saving the info about clonotype models and clonotype ids?

Clonotype model = heavy computations and quite foundational for the analysis; this is what you think a data point is.

Repertoire model = no computations; this a view on your data, a grouping that you want to be light an easy to work with

Optional format / full format – less columns / full format

We need to rewrite sample_id with the filename if the file's sample ID is bad

Benchmarking

many small repertoires (single-cell use case)
several small repertoires (WTF use case)
several big repertoires (typical RepSeq use case)
many big repertoires (we are heading there)

Pre-ordering: https://duckdb.org/docs/guides/performance/indexing

To readers:

strategy for dealing with samples
callback for processing file name

Project details

Release history Release notifications | RSS feed

0.1.1.dev0 pre-release

Mar 16, 2025

This version

0.1.0

Mar 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

immundata-0.1.0.tar.gz (14.7 kB view details)

Uploaded Mar 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

immundata-0.1.0-py3-none-any.whl (12.4 kB view details)

Uploaded Mar 16, 2025 Python 3

File details

Details for the file immundata-0.1.0.tar.gz.

File metadata

Download URL: immundata-0.1.0.tar.gz
Upload date: Mar 16, 2025
Size: 14.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.3

File hashes

Hashes for immundata-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d2cd7b524673d49a44dbc113c5ae499f657896400c5c1ddc7048a356ac4df492`
MD5	`24cee59e528c052fe079a599e12109d4`
BLAKE2b-256	`1ff148333e2ec72efab137fa5068cefcd6a80287656ec780f8161df5bb019050`

See more details on using hashes here.

File details

Details for the file immundata-0.1.0-py3-none-any.whl.

File metadata

Download URL: immundata-0.1.0-py3-none-any.whl
Upload date: Mar 16, 2025
Size: 12.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.3

File hashes

Hashes for immundata-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6ab85c6a71cf5659fb5042e4597b43ae18e85282ab763cb1b06fa2afd8e5c9a3`
MD5	`6049d12c535d81fae9148661c30290d9`
BLAKE2b-256	`aa307cc20b10f6019c7729e077ee396b9542826e5666cec619c9d91ca63bec0d`

See more details on using hashes here.

immundata 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

immundata-py

Installation

Usage

Challenges and use cases

Plans

I/O

Processing input files

Validate pre-input data / connect to external database as a source of truth instead of parquet

Data operations

Metadata storing

Metadata operations

Computations on sample metadata

How to process paired data

Caching

Modalities

Writing back after analysis

How to create a fast / convenient AIRR <-> SC layers data loading and writing

Advanced analysis

Convenient sequence distance calculations

Other thoughts and notes

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes