Skip to main content

Cell type identification using Transcription factor Analysis and Chromatin accessibility

Project description

scatactf

Single-Cell ATAC + RNA Multiome Processing & ML Classification Pipeline


What It Does

Stage Steps Tools
Preprocessing RNA QC → normalization → cell-type annotation Seurat + SingleR (R via rpy2)
Preprocessing ATAC QC → TF-IDF → LSI Signac (R via rpy2)
Preprocessing RNA + ATAC integration → ML-ready CSVs Pure Python
ML Imbalance analysis → SMOTE → feature selection scikit-learn, imbalanced-learn
ML RF + XGBoost + SVM training & evaluation scikit-learn, xgboost
ML 19 plots + JSON report + XLSX matplotlib, seaborn, networkx

Installation

Option A – Local / Team (pip install -e)

git clone https://github.com/your-org/scatactf.git
cd scatactf

# Install R packages (run once inside R)
Rscript -e "
  install.packages('BiocManager')
  BiocManager::install(c(
    'Seurat', 'Signac', 'SingleR', 'celldex',
    'SingleCellExperiment', 'GenomicRanges',
    'EnsDb.Hsapiens.v75', 'biovizBase', 'hdf5r'
  ))
"

# Install Python package
pip install -e ".[dev]"

Option B – PyPI

pip install scatactf
# R must be installed separately

Option C – Docker (recommended for full reproducibility)

docker build -t scatactf:1.0.0 -f docker/Dockerfile .

docker run --rm \
  -v /your/data:/data \
  -v $(pwd)/results:/results \
  scatactf:1.0.0 \
  --input /data --output /results

Data Download

https://www.10xgenomics.com/datasets/pbmc-from-a-healthy-donor-no-cell-sorting-10-k-1-standard-1-0-0

Required files (place in your --input directory):

pbmc_unsorted_10k_filtered_feature_bc_matrix.h5
pbmc_unsorted_10k_per_barcode_metrics.csv
pbmc_unsorted_10k_atac_fragments.tsv.gz
pbmc_unsorted_10k_atac_fragments.tsv.gz.tbi
pbmc_unsorted_10k_atac_peaks.bed

Usage

Command Line

# Full pipeline (preprocessing + ML)
scatactf --input ~/singlecell/ATAC --output my_results

# Preprocessing only (generates python_ready_data/)
scatactf-preprocess --input ~/singlecell/ATAC --output my_results

# ML only (if you already have python_ready_data/)
scatactf-model --data my_results/python_ready_data --output my_results/ml

Python API

from scatactf import run_full_pipeline, run_preprocessing, run_model

# Full pipeline
run_full_pipeline(input_dir="~/singlecell/ATAC", output_dir="my_results")

# Preprocessing only
run_preprocessing(input_dir="~/singlecell/ATAC", output_dir_python="python_ready_data")

# ML only
run_model(data_dir="python_ready_data", output_dir="ml_results")

# Use the ML class directly for more control
from scatactf.mainModel import scATACMLPipeline
pipeline = scATACMLPipeline(data_dir="python_ready_data", output_dir="ml_results")
pipeline.run_complete_pipeline()

Environment Variables

export SCATAC_INPUT_DIR=~/singlecell/ATAC
export SCATAC_OUT_ML=ml_results
scatactf

Output Files

ml_results/

File Description
ml_pipeline_report.json Full JSON report
model_performance_summary.csv Accuracy/F1/AUC per model
detailed_model_results.xlsx Per-class metrics, CV results
model_performance_comparison.png Bar chart comparison
confusion_matrices.png Confusion matrices
class_distribution_analysis.png Cell type distribution
class_balancing_comparison.png Before/after SMOTE
feature_importance.png RF + XGBoost top 20 features
simple_feature_heatmap.png Feature importance heatmap
overfitting_analysis.png CV train vs validation
learning_curves.png Learning curves per model
performance_radar.png Radar chart
feature_distributions.png Violin plots
class_separation_pca.png PCA scatter
basic_tf_network.png Feature–cell-type network

Package Structure

scatactf/
├── src/scatactf/
│   ├── __init__.py          # Public API
│   ├── _version.py
│   ├── config.py            # All parameters (paths, QC thresholds, ML hyperparams)
│   ├── pipeline.py          # run_preprocessing, run_model, run_full_pipeline
│   ├── preprocessing.py     # R preprocessing via rpy2
│   ├── mainModel.py         # scATACMLPipeline class (19-step ML pipeline)
│   ├── cli.py               # scatactf / scatactf-preprocess / scatactf-model
│   └── rscripts/
│       ├── team1_rna.R      # Exact Seurat + SingleR code
│       └── team2_atac.R     # Exact Signac code
├── tests/
│   └── test_model.py
├── pyproject.toml
└── README.md

Tests

pip install -e ".[dev]"
pytest tests/ -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cellitac-1.0.0.tar.gz (30.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cellitac-1.0.0-py3-none-any.whl (30.7 kB view details)

Uploaded Python 3

File details

Details for the file cellitac-1.0.0.tar.gz.

File metadata

  • Download URL: cellitac-1.0.0.tar.gz
  • Upload date:
  • Size: 30.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for cellitac-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f2d4976dd52d7c20c53b89dbf30442ebd60690609a7caefca2eb6dc60124008b
MD5 3e15ee7cbd53b8e0cc4f67904ebd5fe4
BLAKE2b-256 ac6dbe274fc886ba613d6ab5619e4a78368ee255e0cc39b2150dc71bd4610d17

See more details on using hashes here.

File details

Details for the file cellitac-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: cellitac-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 30.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for cellitac-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ca7f247e6bdefb0c012fa18a89a26247dee0120225dd397462b075551c9c81b5
MD5 f00c7d9cfa36ecd1f1aa77147545fdf9
BLAKE2b-256 6c8dbf018b3af7fc85493ba89c525678fe6480aba2204e55cd6dcdb1bc87606c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page