A Python package for QTL detection based on machine learning
Project description
🧬 ML-QTL: Machine Learning for Quantitative Trait Loci Mapping
ML-QTL is a machine learning–based Python tool for QTL mapping. It assesses SNP–trait associations using regression model performance and identifies candidate QTL regions through a sliding window approach. The tool enables efficient gene discovery and supports molecular breeding in crops.
⚙️ Features
- Efficient Data Handling: Utilizes
plinkbinary file formats for genotype data, enabling efficient handling of large-scale genomic datasets - Flexible Modeling: Supports multiple regression models, including Decision Tree Regression, Random Forest Regression, and Support Vector Regression
- Clear Visualization: Generates sliding window prediction results with output visualization capabilities
- Gene-Level Insights: Calculates and reports SNP importance scores within specific genes
- Parallelism: Built-in support for multiprocessing to dramatically speed up analysis
- Flexibility: Offers a Command-Line Interface (CLI) for automation and a **Python API for custom scripting
📦 Installation
We highly recommend using a virtual environment to prevent dependency conflicts.
# Create and activate a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate
Install with pip (Recommended)
Install the latest version directly from PyPI:
pip install mlqtl
Warning As of version 2.3.0, NumPy no longer supports Linux systems with
glibcversion below 2.28. If you are on an older Linux system, please use one of the following installation methods:
# Force install using a binary wheel for NumPy
pip install mlqtl --only-binary=numpy
# Or, install a compatible version of NumPy before installing mlqtl
pip install numpy==2.2.6 mlqtl
Install from Source
-
Download the Source Code
# Clone from GitHub git clone https://github.com/huanglab-cbi/mlqtl.git # Or download from our website wget https://cbi.njau.edu.cn/mlqtl/download/source_code.tar.gz
-
Navigate to the Directory
cd mlqtl
-
Install Dependencies
pip install -r requirements.txt
-
Build the Package
pip install build python -m build
-
Install the Built Package
# Replace {version} with the actual version number pip install dist/mlqtl-{version}-py3-none-any.whl
🚀 Usage
ML-QTL requires genotype data in the plink binary format (.bed, .bim, .fam). If your data is in VCF format, you must first convert it using plink.
The primary CLI tool provides several commands:
❯ mlqtl --help
Usage: mlqtl [OPTIONS] COMMAND [ARGS]...
ML-QTL: Machine Learning for QTL Analysis
Options:
--help Show this message and exit.
Commands:
gff2range Convert GFF3 file to plink gene range format
gtf2range Convert GTF file to plink gene range format
importance Calculate feature importance and plot bar chart
rerun Re-run sliding window analysis with new parameters
run Run ML-QTL analysis
For detailed instructions and API usage, please see the full documentation.
🧪 Example Walkthrough
Step 1: Download Sample Data
Visit the download page to get imputed_base_filtered_v0.7.vcf.gz, gene_location_range.txt, and grain_length.txt.
Alternatively, use the following commands to download them:
wget https://cbi.njau.edu.cn/mlqtl/download/imputed_base_filtered_v0.7.vcf.gz
wget https://cbi.njau.edu.cn/mlqtl/download/gene_location_range.txt
wget https://cbi.njau.edu.cn/mlqtl/download/grain_length.txt
Note: The
gene_location_range.txtis generated based on the GFF file of the reference genome. For details, please refer to the documentation
Step 2: Preprocess the Data
Convert the VCF file to plink's binary format.
# Define the VCF file variable
vcf=imputed_base_filtered_v0.7.vcf.gz
# Run plink to convert and filter the data
plink --vcf ${vcf} \
--snps-only \
--allow-extra-chr \
--make-bed \
--double-id \
--vcf-half-call m \
--extract range gene_location_range.txt \
--out imputed
Step 3: Run ML-QTL Analysis
1. Run Analysis
mlqtl run -g imputed \
-p grain_length.txt \
-r gene_location_range.txt \
-j 8 \
--padj \
--threshold 2.74e-5 \
-o result
2. Calculate SNP Importance
mlqtl importance -g imputed \
-p grain_length.txt \
-r gene_location_range.txt \
--trait grain_length \
--gene Os03g0407400 \
-m DecisionTreeRegressor \
-o result
📊 Performance Benchmark
The -j option sets the number of parallel processes. Generally, the more processes you use, the shorter the runtime. The following benchmarks were conducted on an AMD EPYC 7543 CPU.
| Processes | Memory | Time |
|---|---|---|
| 1 | 1.76G | 5.5h |
| 2 | 2.22G | 2.5h |
| 4 | 3.15G | 1h |
| 8 | 5G | 35min |
| 16 | 8.74G | 19min |
| 32 | 16.18G | 10min |
| 64 | 31.04G | 6min |
Please select an appropriate number of processes based on your system's resources.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlqtl-0.1.8.tar.gz.
File metadata
- Download URL: mlqtl-0.1.8.tar.gz
- Upload date:
- Size: 24.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1aec8d44bcb3a1d22c30df0fb538901518b7846aed81fb49a96aaa89b2cc8dd9
|
|
| MD5 |
bc8dad2b2dd89d905bd57586ae0b8b9d
|
|
| BLAKE2b-256 |
ef81fb254735aeb83d4f5d2c4db59a053aae5fd6135a21b2ce096256d97e9f7d
|
File details
Details for the file mlqtl-0.1.8-py3-none-any.whl.
File metadata
- Download URL: mlqtl-0.1.8-py3-none-any.whl
- Upload date:
- Size: 26.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8bedf28ac5494d00085d8ca5dc05d30f20ab3468e32f0ec0a46fe6aab7fbe43
|
|
| MD5 |
e2d845f96a5ba3d9bd757ef2785f736b
|
|
| BLAKE2b-256 |
9b5cec1f3fcc9b072830bab32b6bc40dfad6b29f2fd10afb4deb79d7263419d1
|