A data layer for single-cell, spatial and bulk immunomics, providing a unified structure for immune receptor repertoire data.
Project description
immundata-py
An efficient data framework for single-cell and bulk immune repertoire datasets of practically any scale.
Think AnnData, SingleCellExperiment or Seurat object, but for AIRR with the full support for out-of-memory datasets and easier access to additional receptor data such as gene expression from single-cell transcriptomics files, spatial data coordinates, or antigen specificity data, provided by user.
The goal of immundata is to standardize I/O and basic data manipulation, following the AIRR Community Data Standard for immune repertoire representation.
It's primary users are bioinformatics developers and data engineers who don't want to write from scratch an abstraction layer over the data.
Biologists and medical scientists could benefit as well, considering they learn the code syntaxis. However, the overall philosophy is to make sure that
immune repertoire data analysis tools such as immunarch cover more than 80% of use cases without explicitly using immundata by the end user.
Installation
poetry install
Usage
See notebooks/immundata-experiments.ipynb.
Challenges and use cases
Plans
Benchmark before each release?
- bulk, single chain
- single-cell, single chain
- Release
0.1.0 - bulk, paired chain
- single-cell, paired chain
- Release
0.2.0 - add new receptor metadata
- write some metadata back to the single cell using barcodes
- Release
0.3.0 - Decide the priority later: multiple metadata sources, like single-cell + spatial
- Decide the priority later: merge multiple immundatas, so support for multiple data sources
- Decide the priority later: optimizations
I/O
Processing input files
Some input file may require processing. Like adding a sample_id column or something like that.
Overall, saving to parquet could increase any computations in contrast to reading from csv.
data management 1 - dump raw data to parquet for better future analysis. Just this huge parquet file with all the receptors, or several parquet files.
data management 2 - repertoire files with built receptors
What is the cost and frequency of building a repertoire with new receptor signature vs. building with a new sample signature?
- receptor signature - pricy, long, rare
- sample signature - ??? do we really need to often re-create samples? But it should be much less costly
Receptor building:
- build unique receptors
- filter out / save non-coding
- figure out how to process multiple chain sequence data
- normalize V/D/J genes, remove/move to a separate column list of segments, leave only the important ones using some strategy (like take first)
Validate pre-input data / connect to external database as a source of truth instead of parquet
- raw data -> build receptors -> dump to parquet
- raw data -> dump to parquet[raw] -> build receptors -> dump to parquet[receptor model]
- database[raw] with some table -> build receptors -> dump to parquet[receptor model]
- database[receptor model]
The problem is that the source[receptor model] must follow the receptor model and the ImmunData format with multiple tables + allow scanning.
Data operations
Metadata storing
If we have a unique receptor pair, how do we store different metadata values coming from different samples? mvalue == metadata value, like gene expression or immunogenicity uid == unique receptor id barcode == unique cell id
Simple version: table1: barcode - receptor - sample id - mvalue
Complex version: table1: uid - receptor - one hot encoding for multiple samples table2: barcode -> mvalues for this receptor from some sample table3: barcode -> uid in this case, immundata is a manager that accurately pre-filters data and concats together necessary columns.
This is virtually a task of database normalization. https://www.freecodecamp.org/news/database-normalization-1nf-2nf-3nf-table-examples/
-
The First Normal Form – 1NF
- For a table to be in the first normal form, it must meet the following criteria:
- a single cell must not hold more than one value (atomicity)
- there must be a primary key for identification
- no duplicated rows or columns
- each column must have only one value for each row in the table
- For a table to be in the first normal form, it must meet the following criteria:
-
The Second Normal Form – 2NF
- The 1NF only eliminates repeating groups, not redundancy. That’s why there is 2NF.
- A table is said to be in 2NF if it meets the following criteria:
- it’s already in 1NF
- has no partial dependency. That is, all non-key attributes are fully dependent on a primary key.
-
The Third Normal Form – 3NF
- When a table is in 2NF, it eliminates repeating groups and redundancy, but it does not eliminate transitive partial dependency.
- This means a non-prime attribute (an attribute that is not part of the candidate’s key) is dependent on another non-prime attribute.
- This is what the third normal form (3NF) eliminates.
- So, for a table to be in 3NF, it must:
- be in 2NF
- have no transitive partial dependency.
Metadata operations
select samples (==all receptors from a sample) with this specific property looks like a little bit more complex filter. Like filter_sample() that runs on samples only. But the typical filter can do that as well, if receptors contain repertoire-level metadata. Moving from sample-level to receptor-level metadata makes our lives much easier. We should just probably store the metadata columns. Theoretically, they are store automatically if we create a sample signatures. Maybe signature + additional information?
Computations on sample metadata
Select all samples that we encounter in specific repertoire group more than 3 times.
How to process paired data
One big table with missing columns? Several tables? Tabular schema written to ImmunData.schema.receptor?
Caching
Run the analysis and cache the results data assigning stuff to ImmunData: immun_data_new = immun_data_old.filter(blah blah)
Or should we do it? Hmm. immun_data_new.cache()? immun_data_new.execute()?
Modalities
Writing back after analysis
filter clonotypes -> do some analysis like hyperexpansion -> write back the metadata that some clones are expanded not sure how to do that because we don't save the UIDs. Or ARE we??? unique_receptors (or just receptors in case of public repertoire) can return all the UIDs. And we can have a function that writes this column back
New problem then: we write the data (lazy). Do we force the caching? Do we wait? What if the user removes this variable? [!!!]
How to create a fast / convenient AIRR <-> SC layers data loading and writing
DECISION: Becuase we can extract barcodes from the original data quite easily, we could forget it and just go with barcodes extraction via AnnData whatever pushed on user. Same with writing some info back.
View into the data Get the list of genes / Or process the receptors, get barcodes, extract data with merging strategy (mean, median, fun), and then do stuff
bc_vec = scdata.select(IL2 >= 2)
imdata.filter(bc_vec).filter(...)
# OR
bc_vec = imdata.scdata.select(IL2 >= 2)
imdata.filter(bc_vec).filter(...)
imdata.filter(sequence == "CDR3blahlbha").extract("IL2", "IL12", combine="mean")
imdata.extract("IL2", "IL12", combine="mean").filter(IL2 >= 5)
imdata.filter(sc.IL2 >= 5, combine="mean")
`sc` - data source slot
main data source slot, other data source slot
imdata.data_slot["123"].some_operation(blah blah filter by) => barcodes => .build_receptors(strategy = "mean").filter().analyze()
Advanced analysis
Convenient sequence distance calculations
-
Use case 1 - give a string and find similar ones with a cutoff for max length
-
Use case 2 - compute pairwise distances and build a graph
Other thoughts and notes
Operation that change the clonotype table structure (select columns), group them (group_by) and doesn't change (filter, mutate)
Do we need to support the clonotypes? Probably, it's all about barcodes
What about supporting the state for ProcessedImmunData, i.e., saving the info about clonotype models and clonotype ids?
Clonotype model = heavy computations and quite foundational for the analysis; this is what you think a data point is.
Repertoire model = no computations; this a view on your data, a grouping that you want to be light an easy to work with
Optional format / full format – less columns / full format
We need to rewrite sample_id with the filename if the file's sample ID is bad
Benchmarking
- many small repertoires (single-cell use case)
- several small repertoires (WTF use case)
- several big repertoires (typical RepSeq use case)
- many big repertoires (we are heading there)
Pre-ordering: https://duckdb.org/docs/guides/performance/indexing
To readers:
- strategy for dealing with samples
- callback for processing file name
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file immundata-0.1.0.tar.gz.
File metadata
- Download URL: immundata-0.1.0.tar.gz
- Upload date:
- Size: 14.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2cd7b524673d49a44dbc113c5ae499f657896400c5c1ddc7048a356ac4df492
|
|
| MD5 |
24cee59e528c052fe079a599e12109d4
|
|
| BLAKE2b-256 |
1ff148333e2ec72efab137fa5068cefcd6a80287656ec780f8161df5bb019050
|
File details
Details for the file immundata-0.1.0-py3-none-any.whl.
File metadata
- Download URL: immundata-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ab85c6a71cf5659fb5042e4597b43ae18e85282ab763cb1b06fa2afd8e5c9a3
|
|
| MD5 |
6049d12c535d81fae9148661c30290d9
|
|
| BLAKE2b-256 |
aa307cc20b10f6019c7729e077ee396b9542826e5666cec619c9d91ca63bec0d
|