Fast and memory-efficient sampling of DNA-Seq or RNA-seq fastq data with or without replacement.
Project description
fastQpick
Fast and memory-efficient sampling of DNA-seq or RNA-seq FASTQ data with replacement. Useful for generating bootstrap replicates to estimate technical variance in downstream analyses, and for subsampling large datasets for testing and benchmarking.
Installation
Install via PyPI
pip install fastQpick
Install from Source Code
Using pip:
pip install git+https://github.com/pachterlab/fastQpick.git
Usage
Command-line Interface
Run fastQpick with a specified fraction and options:
fastQpick [OPTIONS] FASTQ_FILE1 FASTQ_FILE2 ...
Python API
Use fastQpick in your Python code:
from fastQpick import fastQpick
fastQpick(
input_files=['FASTQ_FILE1', 'FASTQ_FILE2', ...],
...
)
Documentation
-
Command-line Help: Use the following command to see all available options:
fastQpick --help -
Python API Help: Use the
helpfunction to explore the API:help(fastQpick)
Tutorials
Two Jupyter notebooks in notebooks/ walk through fastQpick end-to-end:
intro.ipynb— Getting started on synthetic data. Simulates a small RNA-seq experiment with known transcript abundances, draws bootstrap replicates with replacement (fraction=1.0,replacement=True), and shows that the bootstrap standard errors recover the analytic multinomial sampling error.yeast_example.ipynb— Real-data application reproducing Figure 1 of the paper. Bootstraps a paired-end yeast RNA-seq dataset (SRASRR453566), re-quantifies each replicate withkallisto, and characterizes the bootstrap distribution of the transcript abundance estimates.
Features
- Efficient sampling of large FASTQ files.
- Memory efficient - the occurrence vector is sized to the largest per-read count actually drawn (one byte per read in the common case), and low-memory mode further avoids materializing the array of sampled indices.
- Time efficient - streams through the fastq and writes output in batches - generates a full-size (fraction=1, with replacement) bootstrap replicate of a 500M-read FASTQ in ~26 minutes in standard mode, ~56 minutes in low-memory mode, and ~35 minutes in one-pass mode (see Benchmark below).
- Gzip-compressed output by default, using the ISA-L-accelerated
isallibrary to keep compression from bottlenecking the write pass. Pass--disable-gzip(CLI) ordisable_gzip=True(Python API) to write plain FASTQ instead.
License
fastQpick is licensed under the 2-clause BSD license. See the LICENSE file for details.
Contributing
We welcome contributions! Please see the CONTRIBUTING.md file for guidelines on how to get involved.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastqpick-0.3.0.tar.gz.
File metadata
- Download URL: fastqpick-0.3.0.tar.gz
- Upload date:
- Size: 20.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5113ed607f72a62d1a6ea33d095f48236db7c3b0a27424263ef6aaaa81d3ee5a
|
|
| MD5 |
08c6249ad65d10f26d327162f2252ee2
|
|
| BLAKE2b-256 |
c5c18649518878406843c1955d64b8e2ec1e49e756780106a6c40ab2098f7f41
|
File details
Details for the file fastqpick-0.3.0-py3-none-any.whl.
File metadata
- Download URL: fastqpick-0.3.0-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc93c239fd48250f12347df733e3a693011099a19c24fc4e16e48503056ea82e
|
|
| MD5 |
3b7477867b95f057b01ae0fca6c57082
|
|
| BLAKE2b-256 |
2fe62e21a38447bfa930eb2b758a9fbf9763e3c053bdaaa80ae36c8ccdde6df9
|