A cli for running multiple pbs/qsub jobs with HTSeq's htseq-count script on a cluster.
Project description
htseq-count-cluster
A cli wrapper for running htseq's htseq-count on a cluster.
View documentation.
Install
Requires Python 3.9 or higher.
pip install HTSeqCountCluster
Features
- For use with large datasets (we've previously used a dataset of 120 different human samples)
- For use with SGE/SGI cluster systems
- Submits multiple jobs
- Command line interface/script
- Merges counts files into one counts table/csv file
- Uses
accepted_hits.bamfile output oftophat
Examples
Run htseq-count-cluster
After generating bam output files from tophat, instead of using HTSeq's htseq-count, you
can use our htseq-count-cluster script. This script is intended for use with
clusters that are using pbs (qsub) for job monitoring.
Our default htseq-count command is htseq-count -f bam -s no file.bam file.gtf -o htseq.out.
This command does not take into account any strandedness (-s no) for the input bamfiles (-f bam) and uses the default union mode. For the default mode union, only the aligned read determines how the read pair is counted.
Legacy mode (still supported):
htseq-count-cluster -p path/to/bam-files/ -f samples.csv -g genes.gtf -o path/to/cluster-output/
New subcommand mode:
htseq-count-cluster run -p path/to/bam-files/ -f samples.csv -g genes.gtf -o path/to/cluster-output/
| Argument | Description | Required |
|---|---|---|
-p |
This is the path of your .bam files. Presently, this script looks for a folder that is the sample name and searches for an accepted_hits.bam file (tophat output). | Yes |
-f |
You should have a csv file list of your samples or folder names (no header). | Yes |
-g |
This should be the path to your genes.gtf file. | Yes |
-o |
This should be an existing directory for your output counts files. | Yes |
-e |
Email address to send script completion notifications to. | No |
This script uses logzero so there will be color coded logging information to your shell.
A common linux practice is to use screen to create a new shell and run a program
so that if it does produce output to the stdout/shell, the user can exit that particular
shell without the program ending and utilize another shell.
Help message output for htseq-count-cluster
usage: htseq-count-cluster [-h] COMMAND ...
This is a command line wrapper around htseq-count.
positional arguments:
COMMAND
run Run htseq-count jobs on a cluster
merge Merge multiple counts tables into one CSV file
optional arguments:
-h, --help show this help message and exit
*Ensure that htseq-count is in your path.
For the run subcommand:
usage: htseq-count-cluster run [-h] -p INPATH -f INFILE -g GTF -o OUTPATH [-e EMAIL]
Submit multiple htseq-count jobs to a cluster.
optional arguments:
-h, --help show this help message and exit
-p INPATH, --inpath INPATH
Path of your samples/sample folders.
-f INFILE, --infile INFILE
Name or path to your input csv file.
-g GTF, --gtf GTF Name or path to your gtf/gff file.
-o OUTPATH, --outpath OUTPATH
Directory of your output counts file. The counts file
will be named.
-e EMAIL, --email EMAIL
Email address to send script completion to.
Merge output counts files
In order to prep your data for DESeq2, limma or edgeR, it's best to have 1 merged
counts file instead of multiple files produced from the htseq-count-cluster script.
Using the merge subcommand:
htseq-count-cluster merge -d path/to/cluster-output/
Or using the standalone command (still available):
merge-counts -d path/to/cluster-output/
Help message for merge subcommand
usage: htseq-count-cluster merge [-h] -d DIRECTORY
Merge multiple counts tables into 1 counts .csv file.
Your output file will be named: merged_counts_table.csv
optional arguments:
-h, --help show this help message and exit
-d DIRECTORY, --directory DIRECTORY
Path to folder of counts files.
ToDo
- Monitor jobs.
- Enhance wrapper input for other use cases.
- Add example data.
Maintainers
Shaurita Hutchins | @sdhutchins | ✉
Rob Gilmore | @grabear | ✉
Help
Please feel free to open an issue if you have a question/feedback/problem or submit a pull request to add a feature/refactor/etc. to this project.
Citation
Simon Anders, Paul Theodor Pyl, Wolfgang Huber; HTSeq—a Python framework to work with high-throughput sequencing data, Bioinformatics, Volume 31, Issue 2, 15 January 2015, Pages 166–169, https://doi.org/10.1093/bioinformatics/btu638
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file htseqcountcluster-1.5.tar.gz.
File metadata
- Download URL: htseqcountcluster-1.5.tar.gz
- Upload date:
- Size: 16.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
006322fb7379d61c2a5019149751a41ac5d7343f1d8d6f40063b2dc71e9f0875
|
|
| MD5 |
bde2f3c71bd8dbb7aa64652719e0a635
|
|
| BLAKE2b-256 |
78b68343f1f0623d6d22ff302a86e1d2c314eecbe63c5d389edbecf1206801bf
|
File details
Details for the file htseqcountcluster-1.5-py3-none-any.whl.
File metadata
- Download URL: htseqcountcluster-1.5-py3-none-any.whl
- Upload date:
- Size: 19.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d9f1289de1b60288f15730d0acffa90718bdf29ac643ffc2b29e1d28335aa46
|
|
| MD5 |
656d16aa8ddff850202ad3614f11cf24
|
|
| BLAKE2b-256 |
9fe48c69dcf56f07b599f9474fa898d99ac13154ace458b074a3c1e506365905
|