Skip to main content

Map MapMyCells ABA taxonomy IDs to Cell Ontology (CL) terms

Project description

MapMyCells2CL

Annotate MapMyCells output with Cell Ontology (CL) terms.

MapMyCells assigns cells to Allen Brain Atlas (ABA) taxonomy nodes (e.g. CS20230722_SUBC_053). This library maps those IDs to CL or Provisional Cell Ontology (PCL) terms and selects the most specific CL term using information-content (IC) ranking — ready for CELLxGENE schema compliance.


Quick start

pip install mapmycells2cl

# Annotate a MapMyCells CSV
mmc2cl annotate results.csv
# → results_annotated.csv

# Annotate an h5ad file (CxG-compliant obs columns)
mmc2cl annotate-h5ad results.csv cells.h5ad
# → cells_annotated.h5ad

Installation

pip install mapmycells2cl
# or
uv add mapmycells2cl

Installing the package places two equivalent commands in your PATH: mmc2cl (short form) and mapmycells2cl.

From source (development)

Requires uv.

git clone https://github.com/Cellular-Semantics/MapMyCells2CL.git
cd MapMyCells2CL
uv sync

# Run via uv (no venv activation needed)
uv run mmc2cl annotate results.csv

# Or activate the venv once and use the command directly
source .venv/bin/activate
mmc2cl annotate results.csv

CLI reference

annotate

Annotate a MapMyCells CSV or JSON output file with CL terms.

mmc2cl annotate INPUT_FILE [OPTIONS]
Option Description
-o, --output PATH Output file path. Defaults to <input>_annotated.<ext>
--mapping PATH Path to a custom mapping JSON (default: bundled mapping.json)

Examples:

# Annotate CSV
mmc2cl annotate results.csv

# Annotate JSON
mmc2cl annotate results.json

# Specify output path
mmc2cl annotate results.csv -o /data/annotated.csv

CSV output columns — added after each {level}_label column using the CAP/HCA double-dash convention:

Column Content When
{level}--cell_type_ontology_term_id Most specific CL CURIE (IC-ranked) Always
{level}--cell_type Label for the above Always
{level}--cell_type_pcl_ontology_term_id PCL exact match CURIE PCL exact only
{level}--cell_type_pcl PCL exact label PCL exact only
{level}--cell_type_cl_broad_ontology_term_ids All CL broad CURIEs, |-joined PCL exact only

Example (subclass level, PCL exact match):

subclass_label                              → CS20230722_SUBC_053
subclass--cell_type_ontology_term_id        → CL:4023017
subclass--cell_type                         → sst GABAergic cortical interneuron
subclass--cell_type_pcl_ontology_term_id    → PCL:0110113
subclass--cell_type_pcl                     → Sst Gaba sst GABAergic cortical interneuron (Mmus)
subclass--cell_type_cl_broad_ontology_term_ids → CL:4023017|CL:4023069

JSON outputcell_type_ontology_term_id, cell_type, and (for PCL) cell_type_pcl_ontology_term_id, cell_type_pcl, cell_type_cl_broad_ontology_term_ids are added to each level's assignment dict.


annotate-h5ad

Annotate an AnnData h5ad file with CL terms from a MapMyCells CSV. Adds CL columns directly to adata.obs, including the unprefixed cell_type_ontology_term_id / cell_type pair required by the CELLxGENE schema.

mmc2cl annotate-h5ad MMC_CSV H5AD_IN [OPTIONS]
Option Description
-o, --output PATH Output h5ad path. Defaults to <input>_annotated.h5ad
--cxg-level TEXT Taxonomy level used for unprefixed CxG columns (default: cluster)
--mapping PATH Path to a custom mapping JSON

Examples:

# Annotate h5ad — output written to cells_annotated.h5ad
mmc2cl annotate-h5ad results.csv cells.h5ad

# Use supertype level for the CxG cell_type columns
mmc2cl annotate-h5ad results.csv cells.h5ad --cxg-level supertype

obs columns added:

Column Content
cell_type_ontology_term_id IC-best CL CURIE from --cxg-level (CxG required)
cell_type Label for the above (CxG required)
{level}--cell_type_ontology_term_id Per-level IC-best CL CURIE
{level}--cell_type Per-level label
{level}--cell_type_pcl_ontology_term_id PCL CURIE (PCL exact only)
{level}--cell_type_pcl PCL label (PCL exact only)
{level}--cell_type_cl_broad_ontology_term_ids |-joined broad CL CURIEs (PCL exact only)

Cells present in the h5ad but absent from the mmc CSV get empty strings.


update-mappings

Download the latest pcl.owl and regenerate the bundled mapping.json. Pass --cl-owl to include IC-ranked best-CL data (strongly recommended).

mmc2cl update-mappings [OPTIONS]
Option Description
--owl PATH Use a local pcl.owl instead of downloading
--cl-owl PATH Path to base cl.owl for IC computation. Downloads if omitted
--output PATH Output path (default: bundled src/mapmycells2cl/data/mapping.json)

Examples:

# Download latest pcl.owl and regenerate (no IC)
mmc2cl update-mappings

# With IC ranking (recommended) — requires cl.owl (~63 MB)
mmc2cl update-mappings --cl-owl cl.owl

# Use locally cached files
mmc2cl update-mappings --owl pcl.owl --cl-owl cl.owl

Note: cl.owl is large (~63 MB). The PURL http://purl.obolibrary.org/obo/cl.owl redirects to GitHub; download it manually if needed and pass the path with --cl-owl.


Python API

CellTypeMapper

from mapmycells2cl import CellTypeMapper

mapper = CellTypeMapper()              # bundled mapping
print(mapper.mapping_version)          # e.g. "2025-07-07"
print(mapper.has_ic)                   # True when mapping includes IC data

Single lookup

result = mapper.lookup("CS20230722_SUBC_313")

result.found            # True
result.exact_id         # "CL:4300353"
result.exact_label      # "Purkinje cell (Mmus)"
result.ontology         # "CL"
result.broad            # [] — already CL, no broad match needed
result.best_cl_id       # "CL:4300353" — IC-ranked most specific CL term
result.best_cl_label    # "Purkinje cell (Mmus)"
result.best_cl_ic       # IC score (higher = more specific)
result.mapping_version  # "2025-07-07"
result = mapper.lookup("CS20230722_SUBC_053")

result.exact_id     # "PCL:0110113"
result.ontology     # "PCL"
result.best_cl_id   # "CL:4023017"  — IC-ranked best CL broad match
result.broad        # [BroadMatch(id="CL:4023017", ...), BroadMatch(id="CL:4023069", ...)]

for b in result.broad:
    print(b.id, b.label, b.via)
result = mapper.lookup("CS20230722_UNKNOWN_999")
result.found        # False
result.best_cl_id   # ""

Batch lookup

results = mapper.lookup_many([
    "CS20230722_SUBC_313",
    "CS20230722_SUBC_053",
    "CS20230722_CLUS_0768",
])
# Returns List[MatchResult] in the same order

Annotator (programmatic use)

from pathlib import Path
from mapmycells2cl import CellTypeMapper
from mapmycells2cl.annotator import annotate_csv, annotate_json, annotate_h5ad

mapper = CellTypeMapper()

# CSV / JSON
annotate_csv(Path("results.csv"), Path("results_annotated.csv"), mapper)
annotate_json(Path("results.json"), Path("results_annotated.json"), mapper)

# h5ad — CxG-compliant obs columns
annotate_h5ad(
    Path("results.csv"),
    Path("cells.h5ad"),
    Path("cells_annotated.h5ad"),
    mapper,
    cxg_level="cluster",   # level used for unprefixed cell_type columns
)

How it works

Exact matches

Extracted from owl:equivalentClass axioms in pcl.owl:

CL/PCL_class ≡ CL_0000000 ∧ (RO_0015001 hasValue <ABA_individual>)

Every ABA taxonomy ID maps to either a CL term (direct Cell Ontology entry) or a PCL term (Provisional Cell Ontology — finer-grained types not yet promoted to CL).

Broad matches

For PCL exact matches, the library walks rdfs:subClassOf edges upward until CL terms are reached. Because the hierarchy is a DAG (not a tree), a single PCL term may yield multiple CL broad matches (polyhierarchy).

IC-ranked best CL term

When multiple CL broad matches exist, the most specific is selected using structure-based Information Content computed over the base CL hierarchy (no PCL):

IC(c) = -log2(|distinct leaf descendants of c| / |total CL leaves|)

Higher IC = more specific. This is pre-computed at update-mappings time and stored in mapping.json, so there is no runtime CL dependency.

Coverage (CCN20230722 taxonomy)

Level → CL → PCL Total
CLAS (class) 3 24 27
SUBC (subclass) 15 230 245
SUPT (supertype) 32 983 1,015
CLUS (cluster) 80 5,234 5,314
Total 130 6,471 6,601

Data sources

  • pcl.owl — Provisional Cell Ontology; primary mapping source
  • cl.owl — Base Cell Ontology (no imports); used for IC computation

Both large OWL files are excluded from the repo. The bundled mapping.json is versioned with the PCL release date and includes all pre-computed IC scores.


Development

uv sync --dev

uv run mypy src/                          # type check
uv run ruff check --fix src/ tests/       # lint
uv run ruff format src/ tests/            # format

uv run pytest -m unit --cov              # unit tests (fast, no external deps)
uv run pytest -m integration             # integration tests (requires test_resources/)

CI runs mypy, ruff, and unit tests on every PR via GitHub Actions.


Known gaps

  • Basal Ganglia ABA mappings are absent from CL — fix planned for a future CL release.
  • oaklib integration deferred (not yet needed for current use cases).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mapmycells2cl-0.1.0.tar.gz (169.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mapmycells2cl-0.1.0-py3-none-any.whl (183.3 kB view details)

Uploaded Python 3

File details

Details for the file mapmycells2cl-0.1.0.tar.gz.

File metadata

  • Download URL: mapmycells2cl-0.1.0.tar.gz
  • Upload date:
  • Size: 169.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mapmycells2cl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3a4fd8d753e856534ff48ad5423851e1cd38889fc3593683224703ce72bb4641
MD5 7351659b097406db631152911b6fb82f
BLAKE2b-256 263fecc91b44163838e75e49c58b6ce3833956446856051668d6b5ea8f589b3e

See more details on using hashes here.

File details

Details for the file mapmycells2cl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mapmycells2cl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 183.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mapmycells2cl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d371ff84ab89fd941e9cf5bd489cb34c018098703ea83808cd7ad7e9bb34070a
MD5 8f5df044c3268af9921053c497cf6518
BLAKE2b-256 501c2ed1c5b3b7fe7d4eae98ecf01b89aeccc813b51a4e4798ae676f98c312b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page