The Clustermatch Correlation Coefficient (CCC) with GPU acceleration
Project description
Clustermatch Correlation Coefficient GPU (CCC-GPU)
The Clustermatch Correlation Coefficient (CCC) is a highly-efficient, next-generation correlation coefficient that captures not-only-linear relationships and can work on numerical and categorical data types. CCC-GPU is a GPU-accelerated implementation that provides significant performance improvements for large-scale datasets using CUDA.
📖 Full documentation available at: https://ccc-gpu.readthedocs.io/en/latest/
CCC is based on the simple idea of clustering data points and then computing the Adjusted Rand Index (ARI) between the two clusterings. It is a robust and efficient method that can detect linear and non-linear relationships, making it suitable for a wide range of applications in genomics, machine learning, and data science.
Code Structure
- libs/ccc: Python code for CCC-GPU
- libs/ccc_cuda_ext: CUDA C++ code for CCC-GPU
- tests: Test suits
- nbs: Notebooks for analysis and visualization
Installation
Requirements
Hardware:
- Nvidia GPU with CUDA Compute Capability 8.6 or higher
Software:
- OS: Linux x86_64 distributions using glibc 2.28 or later, including:
- Debian 10+
- Ubuntu 18.10+
- Fedora 29+
- CentOS/RHEL 8+
- Python 3.10 to 3.14
- Nvidia driver with CUDA 12.5 or higher (for GPU acceleration)
Note: You can use command
nvidia-smito check your Nvidia driver and CUDA version.
Note: If you are using another operating system, or architecture other than x86_64, you need to build from source.
Quick Install with pip
The cccgpu package is available via pip from test PyPI. However, note that cccgpu depends on libstdc++. For a smooth installation without compatibility issues with your local system, we recommend using a wrapper conda environment to install it:
# Create conda environment with required dependencies
conda create -n ccc-gpu-env -c conda-forge python=3.10 pip pytest libstdcxx-ng
conda activate ccc-gpu-env
# Install cccgpu from test PyPI
pip install --index-url https://test.pypi.org/simple/ \
--extra-index-url https://pypi.org/simple/ \
--only-binary=cccgpu cccgpu
# Verify installation
python -c "from ccc.coef.impl_gpu import ccc as ccc_gpu; import numpy as np; print(ccc_gpu(np.random.rand(100), np.random.rand(100)))"
Support for more Python versions and architectures requires extra effort, and will be added soon.
Note: This installs from test PyPI while the package is in testing phase. Once stable, it will be available from the main PyPI repository with pip install cccgpu.
Command options explained:
--index-url https://test.pypi.org/simple/: Specifies test PyPI as the primary package index to search forcccgpu--extra-index-url https://pypi.org/simple/: Adds the main PyPI repository as a fallback to install dependencies (numpy, scipy, numba, etc.) that may not be available on test PyPI--only-binary=cccgpu: Ensures that only binary wheels are installed forcccgpupackage, so you don't need to compile it from sourcecccgpu: The package name to install
Install from Source
Install from source using the provided conda-lock environment:
1. Clone Repository
# Clone the repository
git clone https://github.com/pivlab/ccc-gpu
cd ccc-gpu
2. Setup Environment with conda-lock
This process uses pipx to install conda-lock in an isolated environment, keeping your base environment clean:
Why conda-lock? We use conda-lock to ensure reproducible installations across different systems. Unlike regular
environment.ymlfiles, conda-lock provides exact version pins for all packages and their dependencies, preventing version conflicts and ensuring you get the same environment that was tested during development.
# Install conda-lock using pipx (installs in isolated environment)
pipx install conda-lock
# Create the main ccc-gpu environment from lock file
conda-lock install --name ccc-gpu conda-lock.yml # or: conda-lock install --name ccc-gpu conda-lock.yml --conda mamba
# Activate the main environment
conda activate ccc-gpu
# Install the package from source
pip install .
Note: If you prefer to use Mamba for faster package resolution, you can install MiniForge which includes Mamba:
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" bash Miniforge3-$(uname)-$(uname -m).sh -bThen replace
condawithmambain the commands above.
Testing
To execute all the test suites, at the root of the repository, run:
conda activate ccc-gpu
bash ./scripts/run_tests.sh python
Usage
End-to-End Tutorial
You can find a tutorial showing simplified analysis steps for those we used in our paper in this notebook using the public GTEx v8 data.
Basic Usage
CCC-GPU provides a simple API identical to the original CCC implementation:
import numpy as np
# New CCC-GPU implementation import
from ccc.coef.impl_gpu import ccc
# Original CCC implementation import
# from ccc.coef.impl import ccc
# Generate sample data
np.random.seed(0)
x = np.random.randn(1000)
y = x**2 + np.random.randn(1000) * 0.1 # Non-linear relationship
# Compute CCC coefficient
correlation = ccc(x, y)
print(f"CCC coefficient: {correlation:.3f}")
Working with Gene Expression Data
CCC-GPU is particularly useful for genomics applications:
import pandas as pd
# New CCC-GPU implementation import
from ccc.coef.impl_gpu import ccc
# Original CCC implementation import
# from ccc.coef.impl import ccc
# Load gene expression data
# Assume genes are in columns, samples in rows
gene_expr = pd.read_csv('gene_expression.csv', index_col=0)
# Compute gene-gene correlations
gene_correlations = ccc(gene_expr.T) # Transpose so genes are in rows
# Find highly correlated gene pairs
import numpy as np
from scipy.spatial.distance import squareform
# Convert to square matrix
corr_matrix = squareform(gene_correlations)
np.fill_diagonal(corr_matrix, 0) # Remove self-correlations
# Find top correlations
top_indices = np.unravel_index(np.argsort(corr_matrix.ravel())[-10:], corr_matrix.shape)
gene_names = gene_expr.columns.tolist()
print("Top 10 gene pairs by CCC:")
for i, j in zip(top_indices[0], top_indices[1]):
print(f"{gene_names[i]} - {gene_names[j]}: {corr_matrix[i, j]:.3f}")
Refer to the original CCC Repository for more usage examples: https://github.com/greenelab/ccc
Controlling Debug Logging
By default, CCC-GPU runs silently without debug output. You can enable detailed logging (including CUDA device information, memory usage, and processing details) using the CCC_GPU_LOGGING environment variable:
# Run with default behavior (no debug output)
python your_script.py
# Enable debug logging for troubleshooting
CCC_GPU_LOGGING=1 python your_script.py
# Or set it for the session
export CCC_GPU_LOGGING=1
python your_script.py
This is particularly useful for:
- Debugging GPU memory issues
- Understanding CUDA device utilization
- Monitoring batch processing performance
- Troubleshooting installation problems
Performance Benchmarks
CCC-GPU provides significant performance improvements over CPU-only implementations:
| Number of genes | CCC-GPU vs. CCC (12 cores) |
|---|---|
| 500 | 16.52 |
| 1000 | 30.65 |
| 2000 | 45.72 |
| 4000 | 59.46 |
| 6000 | 67.46 |
| 8000 | 71.48 |
| 10000 | 72.38 |
| 16000 | 73.83 |
| 20000 | 73.88 |
Benchmarks performed on synthetic gene expression data with 1000 fixed samples. Hardware: AMD Ryzen Threadripper 7960X CPU and an NVIDIA RTX 4090 GPU. Git commit on which the benchmark results were collected: 05f129dfa47ad801eff963b4189484c7c64bd28e
Documentation
Build and view the full documentation locally:
cd docs
make html
Then open docs/build/html/index.html in your browser.
If using VS Code, the Live Preview extension provides convenient in-editor viewing.
Citation
If you use CCC-GPU in your research, please cite the original CCC paper:
@article{zhang2025cccgpu,
title={CCC-GPU: A graphics processing unit (GPU)-optimized nonlinear correlation coefficient for large transcriptomic analyses},
author={Zhang, Haoyu and Fotso, Kevin and Pividori, Milton},
journal={bioRxiv},
year={2025},
doi={10.1101/2025.06.03.657735},
}
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
License
This project is licensed under the BSD 2-Clause License - see the LICENSE file for details.
Acknowledgments
- Original CCC implementation: https://github.com/greenelab/ccc
- CUDA development team for the excellent CUDA toolkit
- pybind11 for seamless Python-C++ integration
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cccgpu-0.2.4-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: cccgpu-0.2.4-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.13, manylinux: glibc 2.24+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e542980c7053d7c276673836ec0f814118a9badad2ed8c85d6f7a04e17918182
|
|
| MD5 |
bbdc1ede7d48c8fdd1303e8886579bd9
|
|
| BLAKE2b-256 |
da075cb0caa1d978a4b4acdd28db3f681fd4bfded6cbb1efd1e5880af8aeac2f
|
File details
Details for the file cccgpu-0.2.4-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: cccgpu-0.2.4-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.12, manylinux: glibc 2.24+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5fa5dcce4904bed9b787c59f51d57e04500292bbbdf67d87327b72b6246403c3
|
|
| MD5 |
78c29285ca2a947360811e2d1252a25f
|
|
| BLAKE2b-256 |
c9f1a266ae27bd60242b4bb4998eee47a87a5d5edafc1910ce1ed1772b9ce1c9
|
File details
Details for the file cccgpu-0.2.4-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: cccgpu-0.2.4-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.11, manylinux: glibc 2.24+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cfc978e71292f6dede068f32be88abd316c5cae7afff50e60ee7a808df27f53b
|
|
| MD5 |
6ab037470c6246d6370789e6a4dbc449
|
|
| BLAKE2b-256 |
d9a06b331b5fed9043de662ce3b98724507ad41829e4fb97a3f50bd5b17e954b
|
File details
Details for the file cccgpu-0.2.4-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: cccgpu-0.2.4-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.10, manylinux: glibc 2.24+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7f0f50ac8196339d2169dd1089e60bdc8a966c635494b94282315fef6f13d8b2
|
|
| MD5 |
de20d26ef0c6a8215959087895ce86cb
|
|
| BLAKE2b-256 |
956976e9a5327f1cbffc095e8645bb040845f807b66ffff762fab3fb7e815ad9
|