Skip to main content

Sort bisulfite-/enzymatic-converted reads from xenograft experiments

Project description

SF-methXsort

methXsort is a command-line toolkit for sorting bisulfite sequencing reads into host species and graft species in xenograft experiments.

MethXsort supports both xengsort and bbsplit for sorting reads into host and graft species. We recommend xengsort, as it is accurate and much faster based on benchmarking results.


Installation

Requirements

  • Python >= 3.12
  • pysam >= 0.15.0
  • xengsort >= 2.0.9

Install from PyPI (recommended)

Once published, you can install methXsort directly from PyPI:

pip install methXsort

After installation, verify it works:

methXsort --version
methXsort --help

Install from source

For development or to get the latest changes:

git clone https://github.com/CCRSF-IFX/methXsort.git
cd methXsort
pip install -e .

This installs the package in editable mode, so any changes to the source code are immediately reflected.

See INSTALLATION.md for more details.


Usage

All commands are run via the installed command:

methXsort <subcommand> [options]

Main Subcommands

Convert Reference Genome

Convert a reference FASTA for bisulfite mapping (C→T and G→A):

methXsort convert-ref <ref_fasta> [-o OUTPUT]

Build xengsort Index

methXsort xengsort-index --host <host.fa> --graft <graft.fa> --index <index_dir> [-n N] [--fill FILL] [--statistics STAT] [-k K] [--xengsort_path <path>] [--xengsort_extra <extra>]

Convert Reads

Convert reads for bisulfite mapping (C→T for R1, G→A for R2):

methXsort convert-reads --read <R1.fastq.gz> [--read2 <R2.fastq.gz>] [--out <R1_out>] [--out2 <R2_out>] [--with_orig_seq]
  • --with_orig_seq: Store the original sequence in the header (slower, but traceable).

Classify Reads with xengsort

methXsort xengsort-classify --read <R1.fastq.gz> [--read2 <R2.fastq.gz>] --index <index_dir> --out_prefix <prefix> --threads <N> [--xengsort_path <path>] [--xengsort_extra <extra>]

Output:

  • {prefix}.host.1.fq.gz: host reads

  • {prefix}.graft.1.fq.gz: graft reads

  • {prefix}.both.1.fq.gz: reads that could originate from both

  • {prefix}.neither.1.fq.gz: reads that originate from neither host nor graft

  • {prefix}.ambiguous.1.fq.gz: (few) ambiguous reads that cannot be classified,

Split Statistics

Output CSV statistics for split reads:

methXsort stat-split --raw <raw_R1.fastq.gz> --host <host_R1.fastq.gz> --graft <graft_R1.fastq.gz>

Restore FASTQ from xengsort Output

Restore original sequences in FASTQ files classified by xengsort:

methXsort restore-fastq --read <classified_R1.fq.gz> --out <restored_R1.fq.gz> [--read2 <classified_R2.fq.gz> --out2 <restored_R2.fq.gz>]

Example Workflow

  1. Convert reference genomes:

    methXsort convert-ref mm10.fa -o mm10_converted.fa
    methXsort convert-ref hg38.fa -o hg38_converted.fa
    
  2. Build bbsplit and xengsort indices:

    methXsort xengsort-index --host mm10_converted.fa --graft hg38_converted.fa --index xengsort_index_7B
    
  3. Convert reads:

    methXsort convert-reads --read sample_R1.fastq.gz --read2 sample_R2.fastq.gz --with_orig_seq
    
  4. Run bbsplit or xengsort:

    methXsort xengsort-classify --read sample_R1.meth.gz --read2 sample_R2.meth.gz --index xengsort_index_7B --out_prefix sample_xengsort --threads 8
    
  5. Restore original FASTQ:

    methXsort restore-fastq --read sample_xengsort-graft.1.fq.gz --out sample_graft_R1_restored.fq.gz --read2 sample_xengsort-graft.2.fq.gz --out2 sample_graft_R2_restored.fq.gz
    

Alternstive workflow using bbsplit

5. Build bbsplit Index

methXsort bbsplit-index --host <host.fa> --graft <graft.fa> --host_name <host> --graft_name <graft> [--bbsplit_path <path>] [--bbsplit_index_path <dir>]

4. Run bbsplit

Split reads into host and graft using bbsplit:

methXsort bbsplit --read <R1.fastq.gz> [--read2 <R2.fastq.gz>] --host <host_name> --graft <graft_name> --out_host <host.bam> --out_graft <graft.bam> [--bbsplit_path <path>] [--bbsplit_extra <extra>]

3. Filter FASTQ by BAM

Extract reads from FASTQ that are present in a BAM file (e.g., after bbsplit):

methXsort filter-fastq-by-bam --read <R1.fastq.gz> [--read2 <R2.fastq.gz>] --bam <file.bam> --out <R1_out> [--out2 <R2_out>] [--filterbyname_path <path>]

Notes

  • For all subcommands, use -h or --help to see detailed options.
  • Make sure all required external tools (bbsplit.sh, filterbyname.sh, xengsort) are in your PATH or specify their locations with the appropriate options.
  • For paired-end data, always provide both --read2 and --out2 where required.

Contact

Email: ccrsfifx@nih.gov

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

methxsort-0.2.0.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

methxsort-0.2.0-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file methxsort-0.2.0.tar.gz.

File metadata

  • Download URL: methxsort-0.2.0.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for methxsort-0.2.0.tar.gz
Algorithm Hash digest
SHA256 858f4c9b648da528b7d96367668aa5b8f1b38ce83e2d1b4ab539c9e1fc52feb7
MD5 fabba02e08a54f8f86f19acb33eba7d4
BLAKE2b-256 c514d4f9b1d6ed0f43aada8dbff94b5e4f512cd46725ef28f7d21967cc34741b

See more details on using hashes here.

File details

Details for the file methxsort-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: methxsort-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for methxsort-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c5d0ef5e0be2e88331525a8aa9b86f0e4f00c15f5e561a4777243a36cc7a70e4
MD5 98d5af5469c07bf644b991516a48f925
BLAKE2b-256 4329bf90cdaa192ef776c85d8ed244c774f4d6658d194a41dd8b4949ac14543c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page