Chunk and scatter the regions in a bed or sequence dict file
Project description
chunked-scatter and scatter-regions
The chunked-scatter
tool takes a bed file, fasta index, sequence dictionary
or vcf file as input and divides the
contigs/chromosomes into overlapping chunks of a given size. These chunks will
then be placed in new bed files, one chromosomes per file. Small chromosomes
will be put together to avoid the creation of thousands of files.
The scatter-regions
tool works in a similar way but with defaults and flags
tuned towards creating genome scatters for GATK tools.
Installation
- Install using pip:
pip install chunked-scatter
- Install using conda:
conda install chunked-scatter
- This requires conda with a bioconda channel.
Usage
chunked-scatter
usage: chunked-scatter [-h] [-p PREFIX] [-S] [-P] [-c SIZE]
[-m MINIMUM_BP_PER_FILE] [-o OVERLAP]
INPUT
Given a sequence dict, fasta index or a bed file, scatter over the defined
contigs/regions. Each contig/region will be split into multiple overlapping
regions, which will be written to a new bed file. Each contig will be placed
in a new file, unless the length of the contigs/regions doesn't exceed a given
number.
positional arguments:
INPUT The input file. The format is detected by the
extension. Supported extensions are: '.bed', '.dict',
'.fai', '.vcf', '.vcf.gz', '.bcf'.
optional arguments:
-h, --help show this help message and exit
-p PREFIX, --prefix PREFIX
The prefix of the ouput files. Output will be named
like: <PREFIX><N>.bed, in which N is an incrementing
number. Default 'scatter-'.
-S, --split-contigs If set, contigs are allowed to be split up over
multiple files.
-P, --print-paths If set prints paths of the output files to STDOUT.
This makes the program usable in scripts and
worfklows.
-c SIZE, --chunk-size SIZE
The size of the chunks. The first chunk in a region or
contig will be exactly length SIZE, subsequent chunks
will SIZE + OVERLAP and the final chunk may be
anywhere from 0.5 to 1.5 times SIZE plus overlap. If a
region (or contig) is smaller than SIZE the original
regions will be returned. Defaults to 1e6
-m MINIMUM_BP_PER_FILE, --minimum-bp-per-file MINIMUM_BP_PER_FILE
The minimum number of bases represented within a
single output bed file. If an input contig or region
is smaller than this MINIMUM_BP_PER_FILE, then the
next contigs/regions will be placed in the same file
untill this minimum is met. Defaults to 45e6.
-o OVERLAP, --overlap OVERLAP
The number of bases which each chunk should overlap
with the preceding one. Defaults to 150.
scatter-regions
usage: scatter-regions [-h] [-p PREFIX] [-S] [-P] [-s SCATTER_SIZE] INPUT
Given a sequence dict, fasta index or a bed file, scatter over the defined
contigs/regions. Creates a bed file where the contigs add up approximately to
the given scatter size.
positional arguments:
INPUT The input file. The format is detected by the
extension. Supported extensions are: '.bed', '.dict',
'.fai', '.vcf', '.vcf.gz', '.bcf'.
optional arguments:
-h, --help show this help message and exit
-p PREFIX, --prefix PREFIX
The prefix of the ouput files. Output will be named
like: <PREFIX><N>.bed, in which N is an incrementing
number. Default 'scatter-'.
-S, --split-contigs If set, contigs are allowed to be split up over
multiple files.
-P, --print-paths If set prints paths of the output files to STDOUT.
This makes the program usable in scripts and
worfklows.
-s SCATTER_SIZE, --scatter-size SCATTER_SIZE
The maximum size for the regions over which to
scatter. If contigs are not split, and a contig is
bigger than the maximum size, the contig will be
placed in its own file. Default: 1000000000.
Examples
bed file
Given a bed file located at /data/regions.bed
:
chr1 100 1000
chr1 2000 16000
chr2 5000 10000
The command:
chunked-scatter -p /data/scatter_ -m 1000 -c 5000 /data/regions.bed
Will produce the following two output files:
/data/scatter_0.bed
:chr1 100 1000 chr1 2000 7000 chr1 6850 12000 chr1 11850 16000
/data/scatter_1.bed
:chr2 5000 10000
dict file
Given a dict file located at /data/ref.dict
:
@SQ SN:chr1 LN:3000000
@SQ SN:chr2 LN:500000
The command:
chunked-scatter -p /data/scatter_ /data/regions.bed
Will produce the following output file at /data/scatter_0.bed
:
chr1 0 1000000
chr1 999850 2000000
chr1 1999850 3000000
chr2 0 500000
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file chunked-scatter-1.0.0.tar.gz
.
File metadata
- Download URL: chunked-scatter-1.0.0.tar.gz
- Upload date:
- Size: 9.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2635b3e4097fe9f22240f9b946eac812a185fefc28cea5cbe03281321675a02b |
|
MD5 | 1a2c062f2bb5bf571473857fa633e4d0 |
|
BLAKE2b-256 | c529f70d069845c1daf6ae4c74b5f19a8a09d0d3927857dbd69fc1dc3a9aeb4f |
File details
Details for the file chunked_scatter-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: chunked_scatter-1.0.0-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e221fbe878025a012b9e36f7503a9999a4ac192db206fd2949b91b422240f951 |
|
MD5 | 3cb602f7f50041aa6efe46f80410c918 |
|
BLAKE2b-256 | 852dfd57870bdde4a868204e059ae9a94ece54ca2ef8fc49329e15aac9417742 |