Skip to main content

No project description provided

Project description

Fastq Downloader (WIP)

This python package let you download fastq files from ena. It can automatic merge and rename fastq files based on the input file provided. If you have trouble downloading this repo's release, please go to fastgit to see if this project can help you.

Notice for Readme

If you are reading this from pypi, please go to github to read the latest readme file, for I won't modify pypi readme unless new version released.

How to Use

Installation

conda create --name fastq-downloader -c conda-forge -c hcc -c bioconda aspera-cli snakemake-minimal httpx lxml click beautifulsoup4 python=3.9
## use what ever you want to download the gist mentioned above to thisname.smk
conda activate fastq-downloader
pip install fastq-downloader==0.4.4

Usage

make sure to create an info.tsv before, you can just copy from the GEO website, then go to vim, type :set paste to get into paste mode, paste the table into vim, and you can modify the names of samples to suit your need, the downloaded file will then be renamed too. Save the file as whatever name you want, then exit vim the white space will be auto convert to underscore

first, we have to turn info tsv to individual sample accession files

## step 1
## you can use fastq-downloader breakdown-tsv --help to view the help
fastq-downloader breakdown-tsv \
  -i path/to/info.tsv \
  -o path/to/output/dir

All paths can be relative paths.
Then we can start to download.

## step 2
fastq-downloader smk \
  -i path/to/info.tsv \
  -o path/to/output/dir \
  -t {number_of_threads_you_want} \
  --download-backend ascp

after the download is done, you can use find command to get all of the fastq.gz files and link them to anoter place. For example I have a bunch of file downloaded to download folder, the folder structure should look like this:

# this is what inside download folder
. 
└── merged_fastq
    ├── GSM5159835
    │   ├── wt_1_R1.fastq.gz
    │   └── wt_1_R2.fastq.gz
    ├── GSM5159836
    │   ├── wt_2_R1.fastq.gz
    │   └── wt_2_R2.fastq.gz
    └── GSM5159837
        ├── wt_3_R1.fastq.gz
        └── wt_3_R2.fastq.gz

Then execute find -name "*fastq.gz" | xargs -I {} ln -s {} .
All fastq.gz files will be linked to the root of download folder:

.
├── merged_fastq
│   ├── GSM5159835
│   │   ├── wt_1_R1.fastq.gz
│   │   └── wt_1_R2.fastq.gz
│   ├── GSM5159836
│   │   ├── wt_2_R1.fastq.gz
│   │   └── wt_2_R2.fastq.gz
│   └── GSM5159837
│       ├── wt_3_R1.fastq.gz
│       └── wt_3_R2.fastq.gz
├── wt_1_R1.fastq.gz -> merged_fastq/GSM5159835/wt_1_R1.fastq.gz
├── wt_1_R2.fastq.gz -> merged_fastq/GSM5159835/wt_1_R2.fastq.gz
├── wt_2_R1.fastq.gz -> merged_fastq/GSM5159836/wt_2_R1.fastq.gz
├── wt_2_R2.fastq.gz -> merged_fastq/GSM5159836/wt_2_R2.fastq.gz
├── wt_3_R1.fastq.gz -> merged_fastq/GSM5159837/wt_3_R1.fastq.gz
└── wt_3_R2.fastq.gz -> merged_fastq/GSM5159837/wt_3_R2.fastq.gz

This should add some convinience for your subsequent process.

These command lines should suit your need in most situations, for those who want more flexiblity and control to the underlying snakemake workflow, you can append your argument to the -s option of the smk subcommand; or you can directly use the snakemake file in this repo.

For other advanced use you can always use --help, or read the source code.

It will automatically try to download the file, check md5, retry if file integrity check failed, and merge the files if the number of files is more than 2, finally rename the files to the description you provided.

prepare the info.tsv like this: note the file must be tab delimited (tsv file), you can simply achieve this by paste it from the Excel or GEO website. Or from SRA Run Selector downloaded csv file.

GSM12345  h3k9me3_rep1
GSM12345  h3k9me3_rep2

Notice for Commonly Encountered Problems

  1. error from ascp saying failed to authenticate:
  • It can be a network issue according to this issue on github or a server issue of EBI this post on biostar
  • If you have encountered this problem, please try to delete the download target folder and change the --download-backend argument to wget to use ftp links.

Todo

  • test for paired-end reads run merge
  • publish to bioconda
  • if fail, retry
  • use dag to run the pipeline (sort of, implemented by using snakemake)
  • option to resume download when md5 not match
  • option to continue from last time download
  • implement second level parallelization

Known Issues

  • Will fail to download the files contains both paired-end reads and single-end reads. (yes it exists).

Update Content

  • 0.4.4:
    • Bump version to trigger pypi readme update
    • Fix version number.
  • 0.4.3:
    • Update readme.
    • Breakdown the download process to two steps and add new download backend and wget.
  • 0.3.2:
    • Add filter for library layout (some sra entry has content mismatches its library layout)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastq-downloader-0.4.4.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

fastq_downloader-0.4.4-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file fastq-downloader-0.4.4.tar.gz.

File metadata

  • Download URL: fastq-downloader-0.4.4.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.10.2 Linux/5.10.102.1-microsoft-standard-WSL2

File hashes

Hashes for fastq-downloader-0.4.4.tar.gz
Algorithm Hash digest
SHA256 9a0bdd1266d5ecf8b827e60890492f65ca6fb02bf80d1c5db834580ad3422bb9
MD5 543aadfd6a5c6cd7308ee37a073d56c0
BLAKE2b-256 cadd8270505ee48f708029600af086dd2e871d69194f580cb9c28f329ef66c37

See more details on using hashes here.

File details

Details for the file fastq_downloader-0.4.4-py3-none-any.whl.

File metadata

  • Download URL: fastq_downloader-0.4.4-py3-none-any.whl
  • Upload date:
  • Size: 15.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.10.2 Linux/5.10.102.1-microsoft-standard-WSL2

File hashes

Hashes for fastq_downloader-0.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 fef29b89c49b76c4b8ca1cfe4b07411292ea7404392dad068db6d4e22b6f67f6
MD5 211fb805c16f833ddafd3cbdbd51f770
BLAKE2b-256 296e891e942811df196a695c8993098604a23924736c0e78628aec135d4df02d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page