Skip to main content

Fast and memory-efficient sampling of DNA-Seq or RNA-seq fastq data with or without replacement.

Project description

fastQpick

Fast and memory-efficient sampling of DNA-seq or RNA-seq FASTQ data with or without replacement.


Installation

Install via PyPI

pip install fastQpick

Install from Source Code

Using pip:

pip install git+https://github.com/pachterlab/fastQpick.git

Or clone the repository and build manually:

git clone https://github.com/pachterlab/fastQpick.git
cd fastQpick
python -m build
python -m pip install dist/fastQpick-x.x.x-py3-none-any.whl

Usage

Command-line Interface

Run fastQpick with a specified fraction and options:

fastQpick --fraction FRACTION [OPTIONS] FASTQ_FILE1 FASTQ_FILE2 ...

Python API

Use fastQpick in your Python code:

from fastQpick import fastQpick

fastQpick(
    input_file_list=['FASTQ_FILE1', 'FASTQ_FILE2', ...],
    fraction=FRACTION,
    ...
)

Documentation

  • Command-line Help: Use the following command to see all available options:

    fastQpick --help
    
  • Python API Help: Use the help function to explore the API:

    help(fastQpick)
    

Options

  • input_files (str, list, or tuple) List of input FASTQ files or directories containing FASTQ files. Required. Positional argument on command line.
  • fraction (int or float) The fraction of reads to sample, as a float greater than 0. Any value equal to or greater than 1 will turn on the -r flag automatically.
  • seed (int or str) Random seed(s). Can provide multiple seeds separated by commas. Default: 42
  • output_dir (str) Output directory. Default: ./fastQpick_output
  • gzip_output (bool) Whether or not to gzip the output. Default: False (uncompressed)
  • group_size (int) The size of grouped files. Provide each pair of files sequentially, separated by a space. E.g., I1, R1, R2 - would have group_size=3. Default: 1 (unpaired)
  • replacement (bool) Sample with replacement. Default: False (without replacement).
  • overwrite (bool) Overwrite existing output files. Default: False
  • low_memory (bool) Whether to use low memory mode (uses ~5.5x less memory than default, but adds marginal time to the data - structure generation preprocessing). Default: False
  • verbose (bool) Whether to print progress information. Default: True

Features

  • Efficient sampling of large FASTQ files.
  • Works with both single and paired-end sequencing data.
  • Supports sampling with or without replacement.
  • Command-line interface and Python API for seamless integration.
  • Memory efficient - in low-memory mode, only uses as much memory as a list of (small) integers the length of the number of reads in the fastq file for each file.
  • Time efficient - only passes through the fastq once and writes to output in batches - can process 600M reads in 10-15 minutes

Low memory mode vs. standard

Low memory mode vs. standard, when fraction=1 (i.e., number of reads to sample is the same as the number of reads in the fastq):

  • Adds an extra ~1-3 seconds per million reads per group_size (i.e., 500M reads would take 30 minutes instead of 20-25 minutes)
  • Saves an extra ~40MB RAM per million reads (i.e., 500M reads would take 3.75GB RAM vs 20.6GB RAM)

Examples

1. Sample 10% of reads with replacement from a FASTQ file:

Command-line

fastQpick --fraction 0.1 -r input.fastq

Python

from fastQpick import fastQpick

fastQpick(
    input_files='input.fastq',
    fraction=0.1,
    replacement=True
)

2. Sample 100% of reads with replacement from multiple paired FASTQ files (R1, R2) across three seeds (i.e., bootstrapping):

Command-line

fastQpick --fraction 1 -s 42,43,44 -r -g 2 input1_R1.fastq input1_R2.fastq

Python

from fastQpick import fastQpick

fastQpick(
    input_files='input.fastq',
    fraction=1,
    seed="42,43,44",
    replacement=True,
    group_size=2,
)

License

fastQpick is licensed under the 2-clause BSD license. See the LICENSE file for details.


Contributing

We welcome contributions! Please see the CONTRIBUTING.md file for guidelines on how to get involved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastqpick-0.1.0.tar.gz (12.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fastQpick-0.1.0-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file fastqpick-0.1.0.tar.gz.

File metadata

  • Download URL: fastqpick-0.1.0.tar.gz
  • Upload date:
  • Size: 12.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for fastqpick-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7d0d82c61952ef0a15f2cc3770b1b525ca6879551bfea17e8a7f725cb3d28156
MD5 e44f6fa04bc4d45ff42736fdc26847eb
BLAKE2b-256 39b05089ef5b978dcecf29f9994b3907d4b24a2efc2f93d549832bf67bf9938e

See more details on using hashes here.

File details

Details for the file fastQpick-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: fastQpick-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for fastQpick-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 554ae856cb2df4229741946f47cee15cae27ab167473535820936d9ac8dced1c
MD5 ee55d79034e4d68e052c53f93224e38a
BLAKE2b-256 8d60f06a8ffa90c880a0d148e4b34b26f0068bcd7a16e283eabe5170a23843c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page