Fast and memory-efficient sampling of DNA-Seq or RNA-seq fastq data with or without replacement.
Project description
fastQpick
Fast and memory-efficient sampling of DNA-seq or RNA-seq FASTQ data with or without replacement.
Installation
Install via PyPI
pip install fastQpick
Install from Source Code
Using pip:
pip install git+https://github.com/pachterlab/fastQpick.git
Or clone the repository and build manually:
git clone https://github.com/pachterlab/fastQpick.git
cd fastQpick
python -m build
python -m pip install dist/fastQpick-x.x.x-py3-none-any.whl
Usage
Command-line Interface
Run fastQpick with a specified fraction and options:
fastQpick --fraction FRACTION [OPTIONS] FASTQ_FILE1 FASTQ_FILE2 ...
Python API
Use fastQpick in your Python code:
from fastQpick import fastQpick
fastQpick(
input_file_list=['FASTQ_FILE1', 'FASTQ_FILE2', ...],
fraction=FRACTION,
...
)
Documentation
-
Command-line Help: Use the following command to see all available options:
fastQpick --help -
Python API Help: Use the
helpfunction to explore the API:help(fastQpick)
Options
- input_files (str, list, or tuple) List of input FASTQ files or directories containing FASTQ files. Required. Positional argument on command line.
- fraction (int or float) The fraction of reads to sample, as a float greater than 0. Any value equal to or greater than 1 forces sampling with replacement.
- seeds (int, str, or list) Random seed(s). Provide a single int (e.g. 42), an inclusive dash-delimited range string (e.g. "1-300" for seeds 1 through 300), or a list mixing ints and range strings (e.g. [1, 2, "5-7"]). On the command line, pass space-separated values (e.g. -s 1 2 5-7). Default: 42
- output_dir (str) Output directory. Default: ./fastQpick_output
- gzip_output (bool) Whether or not to gzip the output. Default: False (uncompressed)
- group_size (int) The size of grouped files. Provide each pair of files sequentially, separated by a space. E.g., I1, R1, R2 - would have group_size=3. Default: 1 (unpaired)
- disable_replacement (bool) Sample without replacement. By default (flag omitted), sampling is done with replacement.
- overwrite (bool) Overwrite existing output files. Default: False
- low_memory (bool) Whether to use low memory mode (uses ~5x less memory than default, but adds marginal time to the data structure-generation preprocessing). Has no effect in one_pass mode. Default: False
- one_pass (bool) Use the single-pass approximate sampler. Skips the read-counting pass and draws each read's output multiplicity independently (Poisson with replacement, Bernoulli without). Runs in O(1) memory with roughly half the read I/O; the output size equals fraction*n only in expectation. Default: False
- verbose (bool) Whether to print progress information. Default: True
Features
- Efficient sampling of large FASTQ files.
- Works with both single and paired-end sequencing data.
- Supports sampling with or without replacement.
- Command-line interface and Python API for seamless integration.
- Memory efficient - the occurrence vector is sized to the largest per-read count actually drawn (one byte per read in the common case), and low-memory mode further avoids materializing the array of sampled indices.
- Time efficient - streams through the fastq and writes output in batches - generates a full-size (fraction=1, with replacement) bootstrap replicate of a 500M-read FASTQ in ~26 minutes in standard mode, ~56 minutes in low-memory mode, and ~35 minutes in one-pass mode (see Benchmark below).
Sampling modes: exact (two-pass) vs. one-pass
fastQpick offers two samplers that trade exactness against resource usage.
Exact (default). Two passes are made over each file: a counting pass to learn the number of reads n, then a writing pass. Given n, exactly floor(fraction * n) reads are sampled (with or without replacement). This is the right choice when the output read count must be exact.
One-pass (--one_pass / one_pass=True). A single pass is made. As each read streams by, its number of copies in the output is drawn independently:
- with replacement:
Poisson(fraction), - without replacement:
Bernoulli(fraction)(the read is kept with probabilityfraction).
Because the draw for a read depends on neither its position nor the total read count, no read is favored over another, so the sample is unbiased and n never needs to be known. This eliminates the counting pass (roughly halving read I/O) and runs in O(1) memory (no occurrence vector is built). The cost is that the output size is fraction * n only in expectation, with relative standard deviation 1 / sqrt(fraction * n) — about 0.03% for a 100M-read library at fraction=0.1. Paired/grouped files stay synchronized because all members of a group draw from the same per-group sub-seed.
Choosing between the three:
| Property | Exact, default | Exact, low-memory (-l) |
One-pass (-p) |
|---|---|---|---|
| Passes over each file | 2 | 2 | 1 |
| Peak memory | index array of m + length-n count temporary, then ~1 byte/read |
~1 byte/read occurrence vector only | O(1) |
| Output size | exactly floor(fraction*n) |
exactly floor(fraction*n) |
fraction*n in expectation |
Both exact modes write from the same occurrence vector (~1 byte per read once built), so they use similar memory during the long writing pass. They differ at the moment the vector is built: the default draws all m indices at once and counts them with np.bincount (fast, but the index array plus the length-n counting temporary raise the transient peak), whereas low-memory streams indices one at a time into the vector (no index array, lower peak, ~5× lower end-to-end on a 500M-read bootstrap, at the cost of some speed). low_memory has no effect in one-pass mode, which is already O(1) memory; setting both simply prints a note.
Benchmark
Generating one full-size bootstrap replicate (fraction=1, with replacement) of a 500-million-read uncompressed FASTQ (143 GB, 150 bp reads), single-threaded on an Intel Xeon Gold 6152 (2.10 GHz). Each mode reads its own file with the page cache evicted immediately beforehand, so every run starts cold.
| Mode | Wall time | Peak memory |
|---|---|---|
| Exact, default | ~26 min | ~9.4 GB |
Exact, low-memory (-l) |
~56 min | ~1.4 GB |
One-pass (-p) |
~35 min | ~0.11 GB |
Peak memory scales with read count, not read length or file size. The default mode is fastest in wall-clock time here because the 143 GB file fits in the machine's memory, so its second (writing) pass re-reads the file from the page cache rather than from disk; the one-pass mode instead reads and writes concurrently in a single pass, moving less total data but contending for the disk. The one-pass time advantage is expected when the library exceeds available memory or the sampled fraction is small, while its ~0.11 GB footprint is constant regardless of read count. The low-memory mode is the most CPU-intensive of the three because it draws each read's multiplicity through the standard-library random generators one read at a time.
Low memory mode vs. standard
Low memory mode vs. standard, when fraction=1 (i.e., number of reads to sample is the same as the number of reads in the fastq):
- Adds an extra ~3.5 seconds per million reads per group_size (i.e., a 500M-read FASTQ took ~56 minutes in low-memory mode vs ~26 minutes in standard mode)
- Saves ~16MB RAM per million reads (i.e., a 500M-read FASTQ used ~1.4GB RAM in low-memory mode vs ~9.4GB RAM in standard mode)
Examples
1. Sample 10% of reads with replacement from a FASTQ file:
Command-line
fastQpick --fraction 0.1 input.fastq
Python
from fastQpick import fastQpick
fastQpick(
input_files='input.fastq',
fraction=0.1
)
Sampling is done with replacement by default. Pass --disable_replacement (CLI) or replacement=False (Python) to sample without replacement.
2. Sample 100% of reads with replacement from multiple paired FASTQ files (R1, R2) across three seeds (i.e., bootstrapping):
Command-line
fastQpick --fraction 1 -s 42 43 44 -g 2 input1_R1.fastq input1_R2.fastq
Python
from fastQpick import fastQpick
fastQpick(
input_files='input.fastq',
fraction=1,
seeds=[42, 43, 44],
replacement=True,
group_size=2,
)
Seeds can also be given as inclusive dash-delimited ranges, which is convenient for many bootstrap replicates. For example, -s 1-300 (or seeds="1-300") runs seeds 1 through 300, and values can be mixed: -s 1 5 10-12 (or seeds=[1, 5, "10-12"]) runs seeds 1, 5, 10, 11, and 12.
3. Sample ~10% of reads in a single pass (approximate output size, O(1) memory):
Command-line
fastQpick --fraction 0.1 --one_pass input.fastq
Python
from fastQpick import fastQpick
fastQpick(
input_files='input.fastq',
fraction=0.1,
one_pass=True,
)
The one-pass sampler skips the counting pass and draws each read's multiplicity on the fly, so it never needs to know the read count. The output contains fraction * n reads in expectation. See Sampling modes for the exact/one-pass trade-off.
License
fastQpick is licensed under the 2-clause BSD license. See the LICENSE file for details.
Contributing
We welcome contributions! Please see the CONTRIBUTING.md file for guidelines on how to get involved.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastqpick-0.2.0.tar.gz.
File metadata
- Download URL: fastqpick-0.2.0.tar.gz
- Upload date:
- Size: 24.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
519a4bd50ad2a62bd92641b8b46f1079870ec069ff84ea96546541a215eba1cc
|
|
| MD5 |
539da744677f569de65c381f8a11ae09
|
|
| BLAKE2b-256 |
e5472b333083c7f59e69f26f736d8aea061e288c1109c97e70cf8ca127866de3
|
File details
Details for the file fastqpick-0.2.0-py3-none-any.whl.
File metadata
- Download URL: fastqpick-0.2.0-py3-none-any.whl
- Upload date:
- Size: 17.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
894b9901701c97a9b68028d126684fd2674834240875acab45b2d7d705f3a649
|
|
| MD5 |
2de43a548c4e5236711531a8ae8febb2
|
|
| BLAKE2b-256 |
f21821320a49d48c869d8c4c09734da94891025c2a34c19a18927c9579ecb80f
|