Skip to main content

No project description provided

Project description

Fastq Downloader (WIP)

This python package let you download fastq files from ena. It can automatic merge and rename fastq files based on the input file provided. If you have trouble downloading this repo's release, please go to fastgit to see if this project can help you.

Notice for Readme

If you are reading this from pypi, please go to github to read the latest readme file, for I won't modify pypi readme unless new version released.

How to Use

Installation

conda create --name fastq-downloader -c conda-forge -c hcc -c bioconda aspera-cli snakemake-minimal httpx lxml click beautifulsoup4 python=3.9
## use what ever you want to download the gist mentioned above to thisname.smk
conda activate fastq-downloader
pip install fastq-downloader==0.4.2

Usage

make sure to create an info.tsv before, you can just copy from the GEO website, then go to vim, type :set paste to get into paste mode, paste the table into vim, and you can modify the names of samples to suit your need, the downloaded file will then be renamed too. Save the file as whatever name you want, then exit vim the white space will be auto convert to underscore

first, we have to turn info tsv to individual sample accession files

## step 1
## you can use fastq-downloader breakdown-tsv --help to view the help
fastq-downloader breakdown-tsv \
  -i path/to/info.tsv \
  -o path/to/output/dir

All paths can be relative paths.
Then we can start to download.

## step 2
fastq-downloader smk \
  -i path/to/info.tsv \
  -o path/to/output/dir \
  -t {number_of_threads_you_want} \
  --download-backend ascp

after the download is done, you can use find command to get all of the fastq.gz files and link them to anoter place. For example I have a bunch of file downloaded to download folder, the folder structure should look like this:

# this is what inside download folder
. 
└── merged_fastq
    ├── GSM5159835
    │   ├── wt_1_R1.fastq.gz
    │   └── wt_1_R2.fastq.gz
    ├── GSM5159836
    │   ├── wt_2_R1.fastq.gz
    │   └── wt_2_R2.fastq.gz
    └── GSM5159837
        ├── wt_3_R1.fastq.gz
        └── wt_3_R2.fastq.gz

Then execute find -name "*fastq.gz" | xargs -I {} ln -s {} .
All fastq.gz files will be linked to the root of download folder:

.
├── merged_fastq
│   ├── GSM5159835
│   │   ├── wt_1_R1.fastq.gz
│   │   └── wt_1_R2.fastq.gz
│   ├── GSM5159836
│   │   ├── wt_2_R1.fastq.gz
│   │   └── wt_2_R2.fastq.gz
│   └── GSM5159837
│       ├── wt_3_R1.fastq.gz
│       └── wt_3_R2.fastq.gz
├── wt_1_R1.fastq.gz -> merged_fastq/GSM5159835/wt_1_R1.fastq.gz
├── wt_1_R2.fastq.gz -> merged_fastq/GSM5159835/wt_1_R2.fastq.gz
├── wt_2_R1.fastq.gz -> merged_fastq/GSM5159836/wt_2_R1.fastq.gz
├── wt_2_R2.fastq.gz -> merged_fastq/GSM5159836/wt_2_R2.fastq.gz
├── wt_3_R1.fastq.gz -> merged_fastq/GSM5159837/wt_3_R1.fastq.gz
└── wt_3_R2.fastq.gz -> merged_fastq/GSM5159837/wt_3_R2.fastq.gz

This should add some convinience for your subsequent process.

These command lines should suit your need in most situations, for those who want more flexiblity and control to the underlying snakemake workflow, you can append your argument to the -s option of the smk subcommand; or you can directly use the snakemake file in this repo.

For other advanced use you can always use --help, or read the source code.

It will automatically try to download the file, check md5, retry if file integrity check failed, and merge the files if the number of files is more than 2, finally rename the files to the description you provided.

prepare the info.tsv like this: note the file must be tab delimited (tsv file), you can simply achieve this by paste it from the Excel or GEO website. Or from SRA Run Selector downloaded csv file.

GSM12345  h3k9me3_rep1
GSM12345  h3k9me3_rep2

Notice for Commonly Encountered Problems

  1. error from ascp saying failed to authenticate:
  • It can be a network issue according to this issue on github or a server issue of EBI this post on biostar
  • If you have encountered this problem, please try to delete the download target folder and change the --download-backend argument to wget to use ftp links.

Todo

  • test for paired-end reads run merge
  • publish to bioconda
  • if fail, retry
  • use dag to run the pipeline (sort of, implemented by using snakemake)
  • option to resume download when md5 not match
  • option to continue from last time download
  • implement second level parallelization

Known Issues

  • Will fail to download the files contains both paired-end reads and single-end reads. (yes it exists).

Update Content

  • 0.4.3:
    • Update readme.
    • Breakdown the download process to two steps and add new download backend and wget.
  • 0.3.2:
    • Add filter for library layout (some sra entry has content mismatches its library layout)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastq-downloader-0.4.3.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

fastq_downloader-0.4.3-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file fastq-downloader-0.4.3.tar.gz.

File metadata

  • Download URL: fastq-downloader-0.4.3.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.10.2 Linux/5.10.102.1-microsoft-standard-WSL2

File hashes

Hashes for fastq-downloader-0.4.3.tar.gz
Algorithm Hash digest
SHA256 29e4989d04398459badc38bd439234c1538e9f278d91ee7915320076e0a41850
MD5 4ceeb4903cd9932ec6bdabe262fe9b6f
BLAKE2b-256 d50fa3d83134fa027cbc280901605b2dfe9ab2b64dc20e320786c095e426d36f

See more details on using hashes here.

File details

Details for the file fastq_downloader-0.4.3-py3-none-any.whl.

File metadata

  • Download URL: fastq_downloader-0.4.3-py3-none-any.whl
  • Upload date:
  • Size: 15.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.10.2 Linux/5.10.102.1-microsoft-standard-WSL2

File hashes

Hashes for fastq_downloader-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d42a4dd6cd330d406263ace156209ea1265324e6f4a5aab83aa7420299101454
MD5 96128d16b6408b1b220169720ed0d48a
BLAKE2b-256 3fb237cc1611ccb362492d8bcde2e59759d807eed0288fbef43161b96a680523

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page