A Python package for identifying essential genes in cancer.
Project description
GenePioneer: A Comprehensive Python Package for Identification of Essential Genes and Modules in Cancer
Description
The GenePioneer was developed as a fast and straightforward way to integrate gene ranking and module detection into a practical, Python-based tool for cancer researchers. It requires minimal input, delivers clear output, and can be run within a Python environment, making it highly user-friendly and accessible to non expert programmers while supporting large-scale dataset analysis. By evaluating gene importance and identifying gene interactions within cancer networks, GenePioneer provides critical insights into the genetic drivers of cancer. Key features include ranking genes by their network significance and identifying the modules they belong to, which helps explore cancer-related pathways and aids in developing precise therapies. GenePioneer’s user-centric design ensures that researchers of all skill levels can make use of its capabilities. By combining comprehensive data integration, advanced networkbased analysis, and statistical rigor, GenePioneer stands as a versatile and impactful resource for cancer research across multiple cancer types.
Features
- Gene Ranking: Determines gene importance based on network significance.
- Module Detection: Identifies gene clusters within cancer pathways.
- Statistical Analysis: Evaluates detected gene modules and their association with known pathways.
- User-Friendly API: Allows researchers of all skill levels to analyze cancer-related genetic data efficiently.
- Full Reproducibility: Option to either use precomputed data or regenerate all components from raw datasets.
Installation
GenePioneer is available via PyPI. Install it using:
pip install genepioneer
Two Usage Modes
You can either:
- Use Preprocessed Data: Run analysis using prebuilt datasets in
Data/cancer-gene-data/andData/module-data/. - Reproduce Everything: Build networks, generate rankings, and detect modules from raw cancer data stored in
GenesData/.
Option 1: Using Preprocessed Data
This is the simplest approach. You only need to provide a list of genes and specify the cancer type.
Step 1: Prepare a Gene List
Create a .txt file containing one gene name per line in the OFFICIAL_GENE_SYMBOL format.
Example (gene_list.txt):
BRCA1
TP53
PTEN
Step 2: Run Gene Analysis
from genepioneer import GeneAnalysis
gene_analysis = GeneAnalysis("Ovary", "./Data/benchmark-data/gene_list.txt")
gene_analysis.analyze_genes()
Step 3: Output
This will generate an output.json file with:
- Gene Rankings: Sorted based on importance in the network.
- Modules: Groups of genes functionally related in cancer.
- Statistical Significance: Evaluation of identified modules.
Supported Cancer Types
"Adrenal", "Bladder", "Brain", "Cervix", "Colon", "Corpus uteri", "Kidney", "Liver", "Ovary", "Prostate", "Skin", "Thyroid"
Option 2: Reproducing Everything (Building Data from Scratch)
If you want full control over data generation, follow these steps to build your own cancer-specific datasets.
Step 1: Add Required Data
You need:
- Raw TCGA Cancer Data (
GenesData/): Cancer-specific gene expression data. - IBM Gene Ontology (
GenesData/IBP_GO_Terms.xlsx): Gene-to-biological process mappings.
Example Directory Structure
GenesData/
│-- IBP_GO_Terms.xlsx
│-- Adrenal/
│ │-- ABL1/
│ │ │-- ABL1.tsv
Step 2: Build Network and Compute Features
from genepioneer import NetworkBuilder
network_builder = NetworkBuilder("Adrenal", "./GenesData")
graph = network_builder.build_network()
features = network_builder.calculate_all_features()
network_builder.save_features_to_csv(features, "./Data/cancer-gene-data/Adrenal")
This step:
- Builds a gene interaction network.
- Computes network-based features (e.g., centrality, entropy, Laplacian scores).
- Saves features to a CSV file.
Step 3: Detect Modules
from genepioneer import NetworkAnalysis
network_analysis = NetworkAnalysis("Adrenal", features)
modules = network_analysis.module_detection()
- Identifies gene modules based on connectivity and functional relevance.
- Saves results as
Data/module-data/Adrenal.json.
Step 4: Run Full Gene Analysis
Once networks and modules are generated, you can proceed with standard gene analysis:
from genepioneer import GeneAnalysis
gene_analysis = GeneAnalysis("Adrenal", "./Data/benchmark-data/gene_list.txt",
cancer_gene_path="./Data/cancer-gene-data",
module_data_path="./Data/module-data")
gene_analysis.analyze_genes()
Dataset Structure and Format
1. TCGA Data (GenesData/*/*.tsv)
Contains gene expression and associated cases.
Example (GenesData/Adrenal/ABL1/ABL1.tsv):
Case ID Expression
TCGA-01 2.5
TCGA-02 1.8
2. IBM Gene Ontology (GenesData/IBP_GO_Terms.xlsx)
Links genes to biological processes.
| Process | Gene1 | Gene2 |
|---|---|---|
| Cell Cycle | BRCA1 | TP53 |
3. Network Features (Data/cancer-gene-data/*.csv)
Stores computed network importance scores.
Example (Data/cancer-gene-data/Adrenal_network_features.csv):
node,ls_score
ABL1,0.85
TP53,0.92
4. Module Data (Data/module-data/*.json)
Contains detected gene modules.
Example (Data/module-data/Adrenal.json):
{
"module_1": [
["ABL1", "TP53"],
3.5,
1.2
]
}
Reproducibility Steps
- Clone the Repository
git clone https://github.com/yourusername/GenePioneer.git
cd GenePioneer
- Install Dependencies
pip install -r requirements.txt
- Add or Generate Data
- Use prebuilt data (Option 1), or
- Generate data from raw sources (Option 2).
- Run Gene Analysis
python -m genepioneer.gene_analysis "Adrenal" "./Data/benchmark-data/gene_list.txt"
- Verify Output
output.jsoncontains ranked genes and detected modules.
Questions about the implementation:
Amirhossein Haerianardakani, haerian.amirhossein[at]gmail.com
If you encounter a bug, experience a failed function, or have a feature request, please open an issue in the GitHub or contact Amirhossein.
License
This project is licensed under the MIT License - MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genepioneer-1.1.0.tar.gz.
File metadata
- Download URL: genepioneer-1.1.0.tar.gz
- Upload date:
- Size: 4.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
509b99fb75cc4ef0c94352e508e94a6e1e45d32443112185eb01d8e2398755c5
|
|
| MD5 |
ffb6c1860c34a118e6ccf3e583497e5b
|
|
| BLAKE2b-256 |
9275335c665e616eebcb1129f7a455cb3984b8d4021110be3e0c0586b1a100db
|
Provenance
The following attestation bundles were made for genepioneer-1.1.0.tar.gz:
Publisher:
publish.yml on amirhossein-haerian/GenePioneer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
genepioneer-1.1.0.tar.gz -
Subject digest:
509b99fb75cc4ef0c94352e508e94a6e1e45d32443112185eb01d8e2398755c5 - Sigstore transparency entry: 179423483
- Sigstore integration time:
-
Permalink:
amirhossein-haerian/GenePioneer@abcbdaef3342496f101f1ecaf7070532bc603f59 -
Branch / Tag:
refs/tags/1.1.0 - Owner: https://github.com/amirhossein-haerian
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@abcbdaef3342496f101f1ecaf7070532bc603f59 -
Trigger Event:
release
-
Statement type:
File details
Details for the file genepioneer-1.1.0-py3-none-any.whl.
File metadata
- Download URL: genepioneer-1.1.0-py3-none-any.whl
- Upload date:
- Size: 4.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
57471b2673054e8a206dc4f7de997853666492118975716b4ced8fda9ec58301
|
|
| MD5 |
7f9321d7b5358ca5f46311966fd22478
|
|
| BLAKE2b-256 |
90ab73c5ae382b5ebe5e149b87a1a968327e4fed695a6b5d94d56b022efd2448
|
Provenance
The following attestation bundles were made for genepioneer-1.1.0-py3-none-any.whl:
Publisher:
publish.yml on amirhossein-haerian/GenePioneer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
genepioneer-1.1.0-py3-none-any.whl -
Subject digest:
57471b2673054e8a206dc4f7de997853666492118975716b4ced8fda9ec58301 - Sigstore transparency entry: 179423484
- Sigstore integration time:
-
Permalink:
amirhossein-haerian/GenePioneer@abcbdaef3342496f101f1ecaf7070532bc603f59 -
Branch / Tag:
refs/tags/1.1.0 - Owner: https://github.com/amirhossein-haerian
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@abcbdaef3342496f101f1ecaf7070532bc603f59 -
Trigger Event:
release
-
Statement type: