Skip to main content

Anti-correlation based feature selection for single cell (and other) omics datasets

Project description

anticor_features

Anti-correlation based feature selection for single cell (and other) omics datasets.

Features

  • Unsupervised feature selection based on gene-gene anti-correlations.
  • Automatically filters out genes in mitochondrial, ribosomal, and other pathways (customizable).
  • Scales to large datasets using HDF5-backed intermediate files.
  • Integrated Python API and command-line interface.
  • Passes null-dataset tests for robust selection.

Installation

Requires Python 3.6 or higher.

Install from PyPI:

pip install anticor_features

Or install from source:

git clone https://github.com/scottyler89/anticor_fs.git
cd anticor_fs
pip install .

Dependencies

  • h5py
  • numpy
  • pandas
  • scipy
  • seaborn
  • matplotlib
  • numba
  • ray
  • gprofiler-official (>=0.3.5)
  • psutil

Quickstart Python API

from anticor_features.anticor_features import get_anti_cor_genes

# exprs: array-like or HDF5 dataset with genes in rows and cells in columns
# feature_ids: list of gene IDs matching rows of exprs
# species: g:Profiler species code (e.g., "hsapiens" or "mmusculus")
anti_cor_table = get_anti_cor_genes(exprs, feature_ids, species="hsapiens")

# Filter selected genes
selected = anti_cor_table.loc[anti_cor_table["selected"], "gene"].tolist()
print(selected)

See the g:Profiler organism list for valid species codes: https://biit.cs.ut.ee/gprofiler/page/organism-list

Customization

  • pre_remove_features: list of gene IDs to exclude before analysis.
  • pre_remove_pathways: list of GO term codes whose genes will be removed.
  • min_express_n: minimum number of cells a gene must be expressed in to be considered (set to -1 to disable filtering, e.g., for non-expression or non-single-cell data).
  • scratch_dir: directory for temporary HDF5 files (default: system temp directory).
  • bin_size: number of features per batch when computing correlation matrix.
  • FPR and FDR: false positive rate and false discovery rate for negative correlations.
  • num_pos_cor: minimum number of positive correlations to select a feature.
  • offline_mode: when True, disallow network calls (requires a local ID bank for default pathway removal).
  • id_bank_dir: directory containing precomputed ID banks (defaults to the packaged/shipped bank; override with ANTICOR_FEATURES_ID_BANK_DIR).
  • use_live_pathway_lookup: when True, force live GO-term resolution (g:Profiler) instead of using the shipped/local ID bank.

Offline / HPC usage (no g:Profiler dependency)

anticor_features uses the packaged/shipped ID bank by default for the default pathway removal (no g:Profiler needed).

To ensure fully offline runs (and to avoid any fallback network calls), set offline_mode=True and generate a local ID bank (in an environment with network access):

python3 scripts/build_id_bank.py --species hsapiens --provider ncbi

Then run feature selection with offline_mode=True (point to your custom bank via ANTICOR_FEATURES_ID_BANK_DIR or id_bank_dir=).

Using with Non-Expression or Other Omics Data

For datasets that are not single-cell or gene-expression matrices (e.g., bulk omics, proteomics, metabolomics, or other feature embeddings), you can skip the minimum-expression filter and run only the anti-correlation statistics by setting min_express_n=-1. For example:

anti_cor_df = get_anti_cor_genes(
    embed_df,
    feature_ids=embed_df.index.tolist(),
    pre_remove_features=[],
    pre_remove_pathways=[],
    min_express_n=-1
)

Setting min_express_n=-1 disables the minimum-expression requirement (only meaningful for count-based single-cell data), allowing all features to be included in the statistical analysis.

Scanpy Integration

When using Scanpy (AnnData), transpose the data matrix:

from anticor_features.anticor_features import get_anti_cor_genes

anti_cor_table = get_anti_cor_genes(
    adata.X.T,
    adata.var.index.tolist(),
    species="hsapiens"
)

import pandas as pd
adata.var = pd.concat([adata.var, anti_cor_table], axis=1)
selected = anti_cor_table.loc[anti_cor_table["selected"], "gene"].tolist()
adata.raw = adata
adata = adata[:, selected]

Command-Line Interface

python3 -m anticor_features.anticor_features \
  -i exprs.tsv \
  -species mmusculus \
  -out_file anti_cor_features.tsv \
  -scratch_dir /path/to/tmp \
  -use_default_pathway_removal

Options:

  • -i, --infile: input expression matrix (TSV or HDF5).
  • -species: g:Profiler species code (default: "hsapiens").
  • -out_file: output file path for the results table.
  • -hdf5: treat input as HDF5 with dataset key "infile".
  • -ids: file with feature (gene) IDs (no header) for HDF5 input.
  • -cols: file with sample (cell) IDs (with header) for HDF5 input.
  • -scratch_dir: directory for temporary files.
  • -use_default_pathway_removal: remove default mitochondrial, ribosomal, and related pathways.
  • -h, --help: display full help message.

Performance

Computing time scales with number of features and batch size. Selecting anti-correlated features on ~10k genes and ~3k cells typically takes 1–2 minutes (network time for g:Profiler). Larger datasets may take longer.

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

Contact

Scott Tyler scottyler89+bitbucket@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anticor_features-0.2.7.tar.gz (118.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anticor_features-0.2.7-py3-none-any.whl (119.0 kB view details)

Uploaded Python 3

File details

Details for the file anticor_features-0.2.7.tar.gz.

File metadata

  • Download URL: anticor_features-0.2.7.tar.gz
  • Upload date:
  • Size: 118.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for anticor_features-0.2.7.tar.gz
Algorithm Hash digest
SHA256 b28f21fb1d5f690d9748e9f5e8cdc7daf36e3d44dca4ab515efcb8720c886d71
MD5 d7f8320c202bf88cf318f1a3d915323c
BLAKE2b-256 0fd73d2c3f1e3e95d3bcc11caa4148b178361f0a249e2f1e8b0fc575b29fbeb4

See more details on using hashes here.

File details

Details for the file anticor_features-0.2.7-py3-none-any.whl.

File metadata

File hashes

Hashes for anticor_features-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 062ba54356617ed9c021cbb8c931ccc942b071d5f0d66906c37772c4d4dabc56
MD5 c1e7e9c28755a36c2be914efe741989d
BLAKE2b-256 2e4ef475f708cf499ce7452faca3a8abfbf12c2e1345a767d9de719764b9951d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page