Skip to main content

Cell type Identification using Transcription factor Analysis and Chromatin accessibility

Project description

cellitac

Cell type Identification using Transcription factor Analysis and Chromatin accessibility

A pipeline for processing Single-Cell ATAC + RNA Multiome data and classifying cell types using Machine Learning.


What It Does

Stage Steps Tools
Preprocessing RNA QC → normalization → cell-type annotation Seurat + SingleR (R via rpy2)
Preprocessing ATAC QC → TF-IDF → LSI Signac (R via rpy2)
Preprocessing RNA + ATAC integration → ML-ready CSVs Pure Python
ML Imbalance analysis → SMOTE → feature selection scikit-learn, imbalanced-learn
ML RF + XGBoost + SVM training & evaluation scikit-learn, xgboost
ML 19 plots + JSON report + XLSX matplotlib, seaborn, networkx

⚠️ Note: cellitac has been developed and tested on PBMC (Peripheral Blood Mononuclear Cells) multiome data. Performance on other cell types or tissues may vary.

Requirements

Before installing cellitac, you need:

  • Linux or macOS (Ubuntu 20.04+ recommended)
  • Python 3.9 – 3.12 (not 3.13+)
  • Conda / Miniconda (download here)
  • ~5 GB free disk space

Installation

Step 1 — Create a Conda environment

conda create -n cellitac python=3.11 -y
conda activate cellitac

Step 2 — Install R and core R libraries via conda

conda install -c conda-forge r-base=4.4.3 -y

conda install -c conda-forge -c bioconda \
  r-matrix r-hdf5r rpy2 \
  bioconductor-summarizedexperiment \
  bioconductor-singlecellexperiment \
  bioconductor-genomicranges \
  bioconductor-delayedarray \
  bioconductor-biocsingular \
  bioconductor-biocneighbors \
  bioconductor-genomicalignments \
  bioconductor-genomicfeatures \
  bioconductor-rtracklayer \
  r-seurat \
  bioconductor-celldex \
  bioconductor-biovizbase -y

Step 3 — Install remaining R packages (takes 10–30 min)

Rscript -e "install.packages('BiocManager', repos='https://cran.r-project.org')"

Rscript -e "BiocManager::install(c(
  'Seurat', 'Signac', 'SingleR', 'celldex',
  'EnsDb.Hsapiens.v75', 'biovizBase', 'data.table'
), ask=FALSE)"

Step 4 — Install cellitac

pip install cellitac

Step 5 — Verify installation

cellitac --help

If you see the help message, you are ready to go ✅


Quick Start

Download test data (PBMC 3k cells, ~560 MB)

mkdir -p ~/data && cd ~/data

wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix.h5
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_atac_peaks.bed
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_per_barcode_metrics.csv

Run the pipeline

conda activate cellitac
cellitac --input ~/data --output ~/results

Full Dataset (PBMC 10k)

mkdir -p ~/data && cd ~/data

wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_filtered_feature_bc_matrix.h5
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_per_barcode_metrics.csv
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_atac_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_atac_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_atac_peaks.bed

Note: cellitac auto-detects file names — your files do not need to follow the 10x naming convention.


Usage

Command Line

# Full pipeline (preprocessing + ML)
cellitac --input ~/data --output my_results

# Preprocessing only
cellitac-preprocess --input ~/data --output my_results

# ML only (if preprocessing already done)
cellitac-model --data my_results/python_ready_data --output my_results/ml

Python API

from cellitac import run_full_pipeline, run_preprocessing, run_model

# Full pipeline
run_full_pipeline(input_dir="~/data", output_dir="my_results")

# Preprocessing only
run_preprocessing(input_dir="~/data", output_dir_python="python_ready_data")

# ML only
run_model(data_dir="python_ready_data", output_dir="ml_results")

# Use the ML class directly
from cellitac.mainModel import scATACMLPipeline
pipeline = scATACMLPipeline(data_dir="python_ready_data", output_dir="ml_results")
pipeline.run_complete_pipeline()

Input Files

File Extension Required
Feature-barcode matrix .h5 ✅ Yes
ATAC fragments .tsv.gz ✅ Yes
Fragments index .tsv.gz.tbi ✅ Yes
Peaks BED file .bed ✅ Yes
Per-barcode QC metrics .csv ⭕ Optional

Output Files

File Description
ml_pipeline_report.json Full JSON report
model_performance_summary.csv Accuracy / F1 / AUC per model
detailed_model_results.xlsx Per-class metrics, CV results
model_performance_comparison.png Bar chart comparison
confusion_matrices.png Confusion matrices
class_distribution_analysis.png Cell type distribution
class_balancing_comparison.png Before/after SMOTE
feature_importance.png RF + XGBoost top 20 features
simple_feature_heatmap.png Feature importance heatmap
overfitting_analysis.png CV train vs validation
learning_curves.png Learning curves per model
performance_radar.png Radar chart
feature_distributions.png Violin plots
class_separation_pca.png PCA scatter
basic_tf_network.png Feature–cell-type network

Package Structure

cellitac/
├── src/cellitac/
│   ├── __init__.py          # Public API
│   ├── config.py            # Parameters (paths, QC thresholds, ML hyperparams)
│   ├── pipeline.py          # run_preprocessing, run_model, run_full_pipeline
│   ├── preprocessing.py     # R preprocessing via rpy2
│   ├── mainModel.py         # scATACMLPipeline class (19-step ML pipeline)
│   ├── cli.py               # cellitac / cellitac-preprocess / cellitac-model
│   └── rscripts/
│       ├── team1_rna.R      # Seurat + SingleR
│       └── team2_atac.R     # Signac
├── tests/
│   └── test_model.py
├── pyproject.toml
└── README.md

Troubleshooting

Problem Solution
conda activate cellitac not working Run conda init then restart terminal
R packages fail to install Make sure you installed from conda first (Step 2) before BiocManager (Step 3)
hdf5r error Run conda install -c conda-forge hdf5 r-hdf5r -y
peak_region_fragments not found Normal for some datasets — pipeline continues automatically
slot deprecated error Make sure you have the latest cellitac version: pip install --upgrade cellitac

Tests

pip install cellitac[dev]
pytest tests/ -v

Contributors

📧 1. Rana H. Abu-Zeidranahamed2111@gmail.com 📧 2. Syrus Semawulesemawulesyrus@gmail.com 📧 3. Emmanuel Aromaemmatitusaroma@gmail.com 📧 4. Toheeb Jumahjumahtoheeb@gmail.com 📧 5. Derek Reiman, Ph.D.dreiman@ttic.edu 📧 6. Olaitan I. Awe, Ph.D.laitanawe@gmail.com


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cellitac-1.0.6.tar.gz (33.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cellitac-1.0.6-py3-none-any.whl (32.0 kB view details)

Uploaded Python 3

File details

Details for the file cellitac-1.0.6.tar.gz.

File metadata

  • Download URL: cellitac-1.0.6.tar.gz
  • Upload date:
  • Size: 33.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for cellitac-1.0.6.tar.gz
Algorithm Hash digest
SHA256 dee67ff993d9cd6c8255bbc473e79219702f5aec7227b56f6059db6a530e0f43
MD5 090dda2a8882edee6f6083de26eb3e60
BLAKE2b-256 66a776e4d51b54d0eb4240e9dd940a0c184bc06010cc7d75de68845b30f24515

See more details on using hashes here.

File details

Details for the file cellitac-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: cellitac-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 32.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for cellitac-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 970a8be540ffed239c8611eb664c9b677b6332a0ee6a2b2133a9b95d498a1aeb
MD5 b885aa6f03967324eb6a597c02f64820
BLAKE2b-256 464aba825cbdda27d0d1bb3f9a0d58bc16095c431e70352059bba5faef516a46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page