Anti-correlation based feature selection for single cell (and other) omics datasets
Project description
anticor_features
Anti-correlation based feature selection for single cell (and other) omics datasets.
Features
- Unsupervised feature selection based on gene-gene anti-correlations.
- Automatically filters out genes in mitochondrial, ribosomal, and other pathways (customizable).
- Scales to large datasets using HDF5-backed intermediate files.
- Integrated Python API and command-line interface.
- Passes null-dataset tests for robust selection.
Installation
Requires Python 3.6 or higher.
Install from PyPI:
pip install anticor_features
Or install from source:
git clone https://github.com/scottyler89/anticor_fs.git
cd anticor_fs
pip install .
Dependencies
- h5py
- numpy
- pandas
- scipy
- seaborn
- matplotlib
- numba
- ray
- gprofiler-official (>=0.3.5)
- psutil
Quickstart Python API
from anticor_features.anticor_features import get_anti_cor_genes
# exprs: array-like or HDF5 dataset with genes in rows and cells in columns
# feature_ids: list of gene IDs matching rows of exprs
# species: g:Profiler species code (e.g., "hsapiens" or "mmusculus")
anti_cor_table = get_anti_cor_genes(exprs, feature_ids, species="hsapiens")
# Filter selected genes
selected = anti_cor_table.loc[anti_cor_table["selected"], "gene"].tolist()
print(selected)
See the g:Profiler organism list for valid species codes: https://biit.cs.ut.ee/gprofiler/page/organism-list
Customization
pre_remove_features: list of gene IDs to exclude before analysis.pre_remove_pathways: list of GO term codes whose genes will be removed.min_express_n: minimum number of cells a gene must be expressed in to be considered (set to -1 to disable filtering, e.g., for non-expression or non-single-cell data).scratch_dir: directory for temporary HDF5 files (default: system temp directory).bin_size: number of features per batch when computing correlation matrix.FPRandFDR: false positive rate and false discovery rate for negative correlations.num_pos_cor: minimum number of positive correlations to select a feature.offline_mode: whenTrue, disallow network calls (requires a local ID bank for default pathway removal).id_bank_dir: directory containing precomputed ID banks (defaults to the packaged/shipped bank; override withANTICOR_FEATURES_ID_BANK_DIR).use_live_pathway_lookup: whenTrue, force live GO-term resolution (g:Profiler) instead of using the shipped/local ID bank.
Offline / HPC usage (no g:Profiler dependency)
anticor_features uses the packaged/shipped ID bank by default for the default pathway removal (no g:Profiler needed).
To ensure fully offline runs (and to avoid any fallback network calls), set offline_mode=True and generate a local ID bank (in an environment with network access):
python3 scripts/build_id_bank.py --species hsapiens --provider ncbi
Then run feature selection with offline_mode=True (point to your custom bank via ANTICOR_FEATURES_ID_BANK_DIR or id_bank_dir=).
Using with Non-Expression or Other Omics Data
For datasets that are not single-cell or gene-expression matrices (e.g., bulk omics, proteomics, metabolomics, or other feature embeddings), you can skip the minimum-expression filter and run only the anti-correlation statistics by setting min_express_n=-1. For example:
anti_cor_df = get_anti_cor_genes(
embed_df,
feature_ids=embed_df.index.tolist(),
pre_remove_features=[],
pre_remove_pathways=[],
min_express_n=-1
)
Setting min_express_n=-1 disables the minimum-expression requirement (only meaningful for count-based single-cell data), allowing all features to be included in the statistical analysis.
Scanpy Integration
When using Scanpy (AnnData), transpose the data matrix:
from anticor_features.anticor_features import get_anti_cor_genes
anti_cor_table = get_anti_cor_genes(
adata.X.T,
adata.var.index.tolist(),
species="hsapiens"
)
import pandas as pd
adata.var = pd.concat([adata.var, anti_cor_table], axis=1)
selected = anti_cor_table.loc[anti_cor_table["selected"], "gene"].tolist()
adata.raw = adata
adata = adata[:, selected]
Command-Line Interface
python3 -m anticor_features.anticor_features \
-i exprs.tsv \
-species mmusculus \
-out_file anti_cor_features.tsv \
-scratch_dir /path/to/tmp \
-use_default_pathway_removal
Options:
-i,--infile: input expression matrix (TSV or HDF5).-species: g:Profiler species code (default: "hsapiens").-out_file: output file path for the results table.-hdf5: treat input as HDF5 with dataset key "infile".-ids: file with feature (gene) IDs (no header) for HDF5 input.-cols: file with sample (cell) IDs (with header) for HDF5 input.-scratch_dir: directory for temporary files.-use_default_pathway_removal: remove default mitochondrial, ribosomal, and related pathways.-h, --help: display full help message.
Performance
Computing time scales with number of features and batch size. Selecting anti-correlated features on ~10k genes and ~3k cells typically takes 1–2 minutes (network time for g:Profiler). Larger datasets may take longer.
License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).
Contact
Scott Tyler scottyler89+bitbucket@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file anticor_features-0.2.7.tar.gz.
File metadata
- Download URL: anticor_features-0.2.7.tar.gz
- Upload date:
- Size: 118.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b28f21fb1d5f690d9748e9f5e8cdc7daf36e3d44dca4ab515efcb8720c886d71
|
|
| MD5 |
d7f8320c202bf88cf318f1a3d915323c
|
|
| BLAKE2b-256 |
0fd73d2c3f1e3e95d3bcc11caa4148b178361f0a249e2f1e8b0fc575b29fbeb4
|
File details
Details for the file anticor_features-0.2.7-py3-none-any.whl.
File metadata
- Download URL: anticor_features-0.2.7-py3-none-any.whl
- Upload date:
- Size: 119.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
062ba54356617ed9c021cbb8c931ccc942b071d5f0d66906c37772c4d4dabc56
|
|
| MD5 |
c1e7e9c28755a36c2be914efe741989d
|
|
| BLAKE2b-256 |
2e4ef475f708cf499ce7452faca3a8abfbf12c2e1345a767d9de719764b9951d
|