Sort bisulfite-/enzymatic-converted reads from xenograft experiments
Project description
SF-methXsort
methXsort is a command-line toolkit for sorting bisulfite sequencing reads into host species and graft species in xenograft experiments.
MethXsort supports both xengsort and bbsplit for sorting reads into host and graft species. We recommend xengsort, as it is accurate and much faster based on benchmarking results.
Installation
Requirements
- Python >= 3.12
- pysam >= 0.15.0
- xengsort >= 2.0.9
Install from PyPI (recommended)
Once published, you can install methXsort directly from PyPI:
pip install methXsort
After installation, verify it works:
methXsort --version
methXsort --help
Install from source
For development or to get the latest changes:
git clone https://github.com/CCRSF-IFX/methXsort.git
cd methXsort
pip install -e .
This installs the package in editable mode, so any changes to the source code are immediately reflected.
See INSTALLATION.md for more details.
Usage
All commands are run via the installed command:
methXsort <subcommand> [options]
Main Subcommands
Convert Reference Genome
Convert a reference FASTA for bisulfite mapping (C→T and G→A):
methXsort convert-ref <ref_fasta> [-o OUTPUT]
Build xengsort Index
methXsort xengsort-index --host <host.fa> --graft <graft.fa> --index <index_dir> [-n N] [--fill FILL] [--statistics STAT] [-k K] [--xengsort_path <path>] [--xengsort_extra <extra>]
Convert Reads
Convert reads for bisulfite mapping (C→T for R1, G→A for R2):
methXsort convert-reads --read <R1.fastq.gz> [--read2 <R2.fastq.gz>] [--out <R1_out>] [--out2 <R2_out>] [--with_orig_seq]
--with_orig_seq: Store the original sequence in the header (slower, but traceable).
Classify Reads with xengsort
methXsort xengsort-classify --read <R1.fastq.gz> [--read2 <R2.fastq.gz>] --index <index_dir> --out_prefix <prefix> --threads <N> [--xengsort_path <path>] [--xengsort_extra <extra>]
Output:
-
{prefix}.host.1.fq.gz: host reads
-
{prefix}.graft.1.fq.gz: graft reads
-
{prefix}.both.1.fq.gz: reads that could originate from both
-
{prefix}.neither.1.fq.gz: reads that originate from neither host nor graft
-
{prefix}.ambiguous.1.fq.gz: (few) ambiguous reads that cannot be classified,
Split Statistics
Output CSV statistics for split reads:
methXsort stat-split --raw <raw_R1.fastq.gz> --host <host_R1.fastq.gz> --graft <graft_R1.fastq.gz>
Restore FASTQ from xengsort Output
Restore original sequences in FASTQ files classified by xengsort:
methXsort restore-fastq --read <classified_R1.fq.gz> --out <restored_R1.fq.gz> [--read2 <classified_R2.fq.gz> --out2 <restored_R2.fq.gz>]
Example Workflow
-
Convert reference genomes:
methXsort convert-ref mm10.fa -o mm10_converted.fa methXsort convert-ref hg38.fa -o hg38_converted.fa
-
Build bbsplit and xengsort indices:
methXsort xengsort-index --host mm10_converted.fa --graft hg38_converted.fa --index xengsort_index_7B
-
Convert reads:
methXsort convert-reads --read sample_R1.fastq.gz --read2 sample_R2.fastq.gz --with_orig_seq
-
Run bbsplit or xengsort:
methXsort xengsort-classify --read sample_R1.meth.gz --read2 sample_R2.meth.gz --index xengsort_index_7B --out_prefix sample_xengsort --threads 8
-
Restore original FASTQ:
methXsort restore-fastq --read sample_xengsort-graft.1.fq.gz --out sample_graft_R1_restored.fq.gz --read2 sample_xengsort-graft.2.fq.gz --out2 sample_graft_R2_restored.fq.gz
Alternstive workflow using bbsplit
5. Build bbsplit Index
methXsort bbsplit-index --host <host.fa> --graft <graft.fa> --host_name <host> --graft_name <graft> [--bbsplit_path <path>] [--bbsplit_index_path <dir>]
4. Run bbsplit
Split reads into host and graft using bbsplit:
methXsort bbsplit --read <R1.fastq.gz> [--read2 <R2.fastq.gz>] --host <host_name> --graft <graft_name> --out_host <host.bam> --out_graft <graft.bam> [--bbsplit_path <path>] [--bbsplit_extra <extra>]
3. Filter FASTQ by BAM
Extract reads from FASTQ that are present in a BAM file (e.g., after bbsplit):
methXsort filter-fastq-by-bam --read <R1.fastq.gz> [--read2 <R2.fastq.gz>] --bam <file.bam> --out <R1_out> [--out2 <R2_out>] [--filterbyname_path <path>]
Notes
- For all subcommands, use
-hor--helpto see detailed options. - Make sure all required external tools (
bbsplit.sh,filterbyname.sh,xengsort) are in yourPATHor specify their locations with the appropriate options. - For paired-end data, always provide both
--read2and--out2where required.
Contact
Email: ccrsfifx@nih.gov
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file methxsort-0.2.0.tar.gz.
File metadata
- Download URL: methxsort-0.2.0.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
858f4c9b648da528b7d96367668aa5b8f1b38ce83e2d1b4ab539c9e1fc52feb7
|
|
| MD5 |
fabba02e08a54f8f86f19acb33eba7d4
|
|
| BLAKE2b-256 |
c514d4f9b1d6ed0f43aada8dbff94b5e4f512cd46725ef28f7d21967cc34741b
|
File details
Details for the file methxsort-0.2.0-py3-none-any.whl.
File metadata
- Download URL: methxsort-0.2.0-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5d0ef5e0be2e88331525a8aa9b86f0e4f00c15f5e561a4777243a36cc7a70e4
|
|
| MD5 |
98d5af5469c07bf644b991516a48f925
|
|
| BLAKE2b-256 |
4329bf90cdaa192ef776c85d8ed244c774f4d6658d194a41dd8b4949ac14543c
|