Skip to main content

The Clustermatch Correlation Coefficient (CCC) with GPU acceleration

Project description

Clustermatch Correlation Coefficient GPU (CCC-GPU)

License Python 3.10+ CUDA Documentation

The Clustermatch Correlation Coefficient (CCC) is a highly-efficient, next-generation correlation coefficient that captures not-only-linear relationships and can work on numerical and categorical data types. CCC-GPU is a GPU-accelerated implementation that provides significant performance improvements for large-scale datasets using CUDA.

📖 Full documentation available at: https://ccc-gpu.readthedocs.io/en/latest/

CCC is based on the simple idea of clustering data points and then computing the Adjusted Rand Index (ARI) between the two clusterings. It is a robust and efficient method that can detect linear and non-linear relationships, making it suitable for a wide range of applications in genomics, machine learning, and data science.

Code Structure

  • libs/ccc: Python code for CCC-GPU
  • libs/ccc_cuda_ext: CUDA C++ code for CCC-GPU
  • tests: Test suits
  • nbs: Notebooks for analysis and visualization

Installation

Requirements

Hardware:

  • Nvidia GPU with CUDA Compute Capability 8.6 or higher

Software:

  • OS: Linux x86_64 distributions using glibc 2.28 or later, including:
    • Debian 10+
    • Ubuntu 18.10+
    • Fedora 29+
    • CentOS/RHEL 8+
  • Python 3.10 to 3.14
  • Nvidia driver with CUDA 12.5 or higher (for GPU acceleration)

Note: You can use command nvidia-smi to check your Nvidia driver and CUDA version.

Note: If you are using another operating system, or architecture other than x86_64, you need to build from source.

Quick Install with pip

The cccgpu package is available via pip from test PyPI. However, note that cccgpu depends on libstdc++. For a smooth installation without compatibility issues with your local system, we recommend using a wrapper conda environment to install it:

# Create conda environment with required dependencies
conda create -n ccc-gpu-env -c conda-forge python=3.10 pip pytest libstdcxx-ng
conda activate ccc-gpu-env

# Install cccgpu from test PyPI
pip install --index-url https://test.pypi.org/simple/ \
            --extra-index-url https://pypi.org/simple/ \
            --only-binary=cccgpu cccgpu

# Verify installation
python -c "from ccc.coef.impl_gpu import ccc as ccc_gpu; import numpy as np; print(ccc_gpu(np.random.rand(100), np.random.rand(100)))"

Support for more Python versions and architectures requires extra effort, and will be added soon.

Note: This installs from test PyPI while the package is in testing phase. Once stable, it will be available from the main PyPI repository with pip install cccgpu.

Command options explained:

  • --index-url https://test.pypi.org/simple/: Specifies test PyPI as the primary package index to search for cccgpu
  • --extra-index-url https://pypi.org/simple/: Adds the main PyPI repository as a fallback to install dependencies (numpy, scipy, numba, etc.) that may not be available on test PyPI
  • --only-binary=cccgpu: Ensures that only binary wheels are installed for cccgpu package, so you don't need to compile it from source
  • cccgpu: The package name to install

Install from Source

Install from source using the provided conda-lock environment:

1. Clone Repository

# Clone the repository
git clone https://github.com/pivlab/ccc-gpu
cd ccc-gpu

2. Setup Environment with conda-lock

This process uses pipx to install conda-lock in an isolated environment, keeping your base environment clean:

Why conda-lock? We use conda-lock to ensure reproducible installations across different systems. Unlike regular environment.yml files, conda-lock provides exact version pins for all packages and their dependencies, preventing version conflicts and ensuring you get the same environment that was tested during development.

# Install conda-lock using pipx (installs in isolated environment)
pipx install conda-lock

# Create the main ccc-gpu environment from lock file
conda-lock install --name ccc-gpu conda-lock.yml  # or: conda-lock install --name ccc-gpu conda-lock.yml --conda mamba

# Activate the main environment
conda activate ccc-gpu

# Install the package from source
pip install .

Note: If you prefer to use Mamba for faster package resolution, you can install MiniForge which includes Mamba:

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh -b

Then replace conda with mamba in the commands above.

Testing

To execute all the test suites, at the root of the repository, run:

conda activate ccc-gpu
bash ./scripts/run_tests.sh python

Usage

End-to-End Tutorial

You can find a tutorial showing simplified analysis steps for those we used in our paper in this notebook using the public GTEx v8 data.

Basic Usage

CCC-GPU provides a simple API identical to the original CCC implementation:

import numpy as np
# New CCC-GPU implementation import
from ccc.coef.impl_gpu import ccc
# Original CCC implementation import
# from ccc.coef.impl import ccc

# Generate sample data
np.random.seed(0)
x = np.random.randn(1000)
y = x**2 + np.random.randn(1000) * 0.1  # Non-linear relationship

# Compute CCC coefficient
correlation = ccc(x, y)
print(f"CCC coefficient: {correlation:.3f}")

Working with Gene Expression Data

CCC-GPU is particularly useful for genomics applications:

import pandas as pd
# New CCC-GPU implementation import
from ccc.coef.impl_gpu import ccc
# Original CCC implementation import
# from ccc.coef.impl import ccc

# Load gene expression data
# Assume genes are in columns, samples in rows
gene_expr = pd.read_csv('gene_expression.csv', index_col=0)

# Compute gene-gene correlations
gene_correlations = ccc(gene_expr.T)  # Transpose so genes are in rows

# Find highly correlated gene pairs
import numpy as np
from scipy.spatial.distance import squareform

# Convert to square matrix
corr_matrix = squareform(gene_correlations)
np.fill_diagonal(corr_matrix, 0)  # Remove self-correlations

# Find top correlations
top_indices = np.unravel_index(np.argsort(corr_matrix.ravel())[-10:], corr_matrix.shape)
gene_names = gene_expr.columns.tolist()

print("Top 10 gene pairs by CCC:")
for i, j in zip(top_indices[0], top_indices[1]):
    print(f"{gene_names[i]} - {gene_names[j]}: {corr_matrix[i, j]:.3f}")

Refer to the original CCC Repository for more usage examples: https://github.com/greenelab/ccc

Controlling Debug Logging

By default, CCC-GPU runs silently without debug output. You can enable detailed logging (including CUDA device information, memory usage, and processing details) using the CCC_GPU_LOGGING environment variable:

# Run with default behavior (no debug output)
python your_script.py

# Enable debug logging for troubleshooting
CCC_GPU_LOGGING=1 python your_script.py

# Or set it for the session
export CCC_GPU_LOGGING=1
python your_script.py

This is particularly useful for:

  • Debugging GPU memory issues
  • Understanding CUDA device utilization
  • Monitoring batch processing performance
  • Troubleshooting installation problems

Performance Benchmarks

CCC-GPU provides significant performance improvements over CPU-only implementations:

Number of genes CCC-GPU vs. CCC (12 cores)
500 16.52
1000 30.65
2000 45.72
4000 59.46
6000 67.46
8000 71.48
10000 72.38
16000 73.83
20000 73.88

Benchmarks performed on synthetic gene expression data with 1000 fixed samples. Hardware: AMD Ryzen Threadripper 7960X CPU and an NVIDIA RTX 4090 GPU. Git commit on which the benchmark results were collected: 05f129dfa47ad801eff963b4189484c7c64bd28e

Documentation

Build and view the full documentation locally:

cd docs
make html

Then open docs/build/html/index.html in your browser.

If using VS Code, the Live Preview extension provides convenient in-editor viewing.

Citation

If you use CCC-GPU in your research, please cite the original CCC paper:

@article{zhang2025cccgpu,
  title={CCC-GPU: A graphics processing unit (GPU)-optimized nonlinear correlation coefficient for large transcriptomic analyses},
  author={Zhang, Haoyu and Fotso, Kevin and Pividori, Milton},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.06.03.657735},
}

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the BSD 2-Clause License - see the LICENSE file for details.

Acknowledgments

  • Original CCC implementation: https://github.com/greenelab/ccc
  • CUDA development team for the excellent CUDA toolkit
  • pybind11 for seamless Python-C++ integration

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cccgpu-0.2.4-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

cccgpu-0.2.4-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

cccgpu-0.2.4-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

cccgpu-0.2.4-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file cccgpu-0.2.4-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cccgpu-0.2.4-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e542980c7053d7c276673836ec0f814118a9badad2ed8c85d6f7a04e17918182
MD5 bbdc1ede7d48c8fdd1303e8886579bd9
BLAKE2b-256 da075cb0caa1d978a4b4acdd28db3f681fd4bfded6cbb1efd1e5880af8aeac2f

See more details on using hashes here.

File details

Details for the file cccgpu-0.2.4-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cccgpu-0.2.4-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5fa5dcce4904bed9b787c59f51d57e04500292bbbdf67d87327b72b6246403c3
MD5 78c29285ca2a947360811e2d1252a25f
BLAKE2b-256 c9f1a266ae27bd60242b4bb4998eee47a87a5d5edafc1910ce1ed1772b9ce1c9

See more details on using hashes here.

File details

Details for the file cccgpu-0.2.4-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cccgpu-0.2.4-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 cfc978e71292f6dede068f32be88abd316c5cae7afff50e60ee7a808df27f53b
MD5 6ab037470c6246d6370789e6a4dbc449
BLAKE2b-256 d9a06b331b5fed9043de662ce3b98724507ad41829e4fb97a3f50bd5b17e954b

See more details on using hashes here.

File details

Details for the file cccgpu-0.2.4-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cccgpu-0.2.4-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7f0f50ac8196339d2169dd1089e60bdc8a966c635494b94282315fef6f13d8b2
MD5 de20d26ef0c6a8215959087895ce86cb
BLAKE2b-256 956976e9a5327f1cbffc095e8645bb040845f807b66ffff762fab3fb7e815ad9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page