Cell type Identification using Transcription factor Analysis and Chromatin accessibility
Project description
cellitac
Cell type Identification using Transcription factor Analysis and Chromatin accessibility
A pipeline for processing Single-Cell ATAC + RNA Multiome data and classifying cell types using Machine Learning.
What It Does
| Stage | Steps | Tools |
|---|---|---|
| Preprocessing | RNA QC → normalization → cell-type annotation | Seurat + SingleR (R via rpy2) |
| Preprocessing | ATAC QC → TF-IDF → LSI | Signac (R via rpy2) |
| Preprocessing | RNA + ATAC integration → ML-ready CSVs | Pure Python |
| ML | Imbalance analysis → SMOTE → feature selection | scikit-learn, imbalanced-learn |
| ML | RF + XGBoost + SVM training & evaluation | scikit-learn, xgboost |
| ML | 19 plots + JSON report + XLSX | matplotlib, seaborn, networkx |
⚠️ Note: cellitac has been developed and tested on PBMC (Peripheral Blood Mononuclear Cells) multiome data. Performance on other cell types or tissues may vary.
Requirements
Before installing cellitac, you need:
- Linux or macOS (Ubuntu 20.04+ recommended)
- Python 3.9 – 3.12 (not 3.13+)
- Conda / Miniconda (download here)
- ~5 GB free disk space
Installation
Step 1 — Create a Conda environment
conda create -n cellitac python=3.11 -y
conda activate cellitac
Step 2 — Install R and core R libraries via conda
conda install -c conda-forge r-base=4.4.3 -y
conda install -c conda-forge -c bioconda \
r-matrix r-hdf5r rpy2 \
bioconductor-summarizedexperiment \
bioconductor-singlecellexperiment \
bioconductor-genomicranges \
bioconductor-delayedarray \
bioconductor-biocsingular \
bioconductor-biocneighbors \
bioconductor-genomicalignments \
bioconductor-genomicfeatures \
bioconductor-rtracklayer -y
Step 3 — Install remaining R packages (takes 10–30 min)
Rscript -e "install.packages('BiocManager', repos='https://cran.r-project.org')"
Rscript -e "BiocManager::install(c(
'Seurat', 'Signac', 'SingleR', 'celldex',
'EnsDb.Hsapiens.v75', 'biovizBase', 'data.table'
), ask=FALSE)"
Step 4 — Install cellitac
pip install cellitac
Step 5 — Verify installation
cellitac --help
If you see the help message, you are ready to go ✅
Quick Start
Download test data (PBMC 3k cells, ~560 MB)
mkdir -p ~/data && cd ~/data
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix.h5
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_atac_peaks.bed
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_per_barcode_metrics.csv
Run the pipeline
conda activate cellitac
cellitac --input ~/data --output ~/results
Full Dataset (PBMC 10k)
mkdir -p ~/data && cd ~/data
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_filtered_feature_bc_matrix.h5
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_per_barcode_metrics.csv
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_atac_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_atac_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_atac_peaks.bed
Note: cellitac auto-detects file names — your files do not need to follow the 10x naming convention.
Usage
Command Line
# Full pipeline (preprocessing + ML)
cellitac --input ~/data --output my_results
# Preprocessing only
cellitac-preprocess --input ~/data --output my_results
# ML only (if preprocessing already done)
cellitac-model --data my_results/python_ready_data --output my_results/ml
Python API
from cellitac import run_full_pipeline, run_preprocessing, run_model
# Full pipeline
run_full_pipeline(input_dir="~/data", output_dir="my_results")
# Preprocessing only
run_preprocessing(input_dir="~/data", output_dir_python="python_ready_data")
# ML only
run_model(data_dir="python_ready_data", output_dir="ml_results")
# Use the ML class directly
from cellitac.mainModel import scATACMLPipeline
pipeline = scATACMLPipeline(data_dir="python_ready_data", output_dir="ml_results")
pipeline.run_complete_pipeline()
Input Files
| File | Extension | Required |
|---|---|---|
| Feature-barcode matrix | .h5 |
✅ Yes |
| ATAC fragments | .tsv.gz |
✅ Yes |
| Fragments index | .tsv.gz.tbi |
✅ Yes |
| Peaks BED file | .bed |
✅ Yes |
| Per-barcode QC metrics | .csv |
⭕ Optional |
Output Files
| File | Description |
|---|---|
ml_pipeline_report.json |
Full JSON report |
model_performance_summary.csv |
Accuracy / F1 / AUC per model |
detailed_model_results.xlsx |
Per-class metrics, CV results |
model_performance_comparison.png |
Bar chart comparison |
confusion_matrices.png |
Confusion matrices |
class_distribution_analysis.png |
Cell type distribution |
class_balancing_comparison.png |
Before/after SMOTE |
feature_importance.png |
RF + XGBoost top 20 features |
simple_feature_heatmap.png |
Feature importance heatmap |
overfitting_analysis.png |
CV train vs validation |
learning_curves.png |
Learning curves per model |
performance_radar.png |
Radar chart |
feature_distributions.png |
Violin plots |
class_separation_pca.png |
PCA scatter |
basic_tf_network.png |
Feature–cell-type network |
Package Structure
cellitac/
├── src/cellitac/
│ ├── __init__.py # Public API
│ ├── config.py # Parameters (paths, QC thresholds, ML hyperparams)
│ ├── pipeline.py # run_preprocessing, run_model, run_full_pipeline
│ ├── preprocessing.py # R preprocessing via rpy2
│ ├── mainModel.py # scATACMLPipeline class (19-step ML pipeline)
│ ├── cli.py # cellitac / cellitac-preprocess / cellitac-model
│ └── rscripts/
│ ├── team1_rna.R # Seurat + SingleR
│ └── team2_atac.R # Signac
├── tests/
│ └── test_model.py
├── pyproject.toml
└── README.md
Troubleshooting
| Problem | Solution |
|---|---|
conda activate cellitac not working |
Run conda init then restart terminal |
| R packages fail to install | Make sure you installed from conda first (Step 2) before BiocManager (Step 3) |
hdf5r error |
Run conda install -c conda-forge hdf5 r-hdf5r -y |
peak_region_fragments not found |
Normal for some datasets — pipeline continues automatically |
slot deprecated error |
Make sure you have the latest cellitac version: pip install --upgrade cellitac |
Tests
pip install cellitac[dev]
pytest tests/ -v
Contributors
📧 1. Rana H. Abu-Zeid — ranahamed2111@gmail.com 📧 2. Syrus Semawule — semawulesyrus@gmail.com 📧 3. Emmanuel Aroma — emmatitusaroma@gmail.com 📧 4. Toheeb Jumah — jumahtoheeb@gmail.com 📧 5. Derek Reiman, Ph.D. — dreiman@ttic.edu 📧 6. Olaitan I. Awe, Ph.D. — laitanawe@gmail.com
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cellitac-1.0.5.tar.gz.
File metadata
- Download URL: cellitac-1.0.5.tar.gz
- Upload date:
- Size: 32.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0104e0a6cadad8b22e69602cc244b6dde2bb685d26877bfe3ca2597cfaa48d6
|
|
| MD5 |
b43a330cf9cff3b3e445cf13ffa95aac
|
|
| BLAKE2b-256 |
4c767b42cc971040a0d31f5c0772c18a3abca0411b0233aff778af4ac93ebbec
|
File details
Details for the file cellitac-1.0.5-py3-none-any.whl.
File metadata
- Download URL: cellitac-1.0.5-py3-none-any.whl
- Upload date:
- Size: 31.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f68f46fd23503274e56e7fceee42178d373b2e8ea4125d6af8945c3c229c6810
|
|
| MD5 |
f930b17667d8c8ef81dc3745c2ef9080
|
|
| BLAKE2b-256 |
ddd4e6edc2855bfb04bb6047765b9e423ab9dce155d29d8e3cc85efa0bcc5dec
|