Skip to main content

Expression-aware gene assignment of ROSE output.

Project description

rosewater

PyPI version

rosewater assigns genes to (super-) enhancer output from ROSE in an expression-aware manner. It allows users to set a TPM threshold to filter genes that are not expressed on a sample-by-sample basis.

Installation

rosewater can be installed via pip. For use, it requires bedtools be available on your PATH.

pip install rosewater

Usage

rosewater is fairly simple to use. It requires an annotation GTF file, a TSV file of TPMs with the gene name column named 'gene' (these should match the 'gene_name' attributes in the annotation GTF), the name of the sample column in the TPM file, and an output file from ROSE. Optionally, users can set a TPM threshold (set to 5 by default) for filtering out lowly/non-expressed genes prior to assignment.

Usage: rosewater [OPTIONS]

  rosewater assigns genes to ROSE output in an expression-aware manner.

Options:
  -a, --annotation PATH       Path to annotation GTF file.  [required]
  -s, --sample TEXT           Sample name that should match a column in the
                              TPM file.  [required]

  -e, --enh_file PATH         Output from ROSE ending with
                              'ENHANCER_TO_GENE.txt'.  [required]

  -t, --tpm_file PATH         A file containing a matrix of TPMs with genes as
                              rows and samples as columns. The gene label
                              column should be named 'gene'.  [required]

  -th, --tpm_threshold FLOAT  The minimum TPM to retain genes for assignment.
                              [default: 5]

  -o, --output_dir PATH       The output directory.  [default:
                              ./EnhancerGeneAssignments]

  --version                   Show the version and exit.
  -h, --help                  Show this message and exit.

Output

Two output files will be generated, named after the ROSE enhancer input file appended with either .rosewater.GeneAssignment.log or .rosewater.GeneAssignment.bed. The log file will contain useful stats such as how many TSSes are filtered by the TPM threshold, how many TSSes overlap each enhancer, the average enhancer size, and how many assignments change from the original ROSE assignments.

The BED-like file will contain the assignments for each enhancer. Two assignments are made for each enhancer - one utilizing all TSSes in the annotation file that meet the TPM threshold and another utilizing only the protein-coding TSSes. These assignments are the last 4 columns of the file. The additional columns are fairly self-explanatory. In short, they contain the overlapping TSSes, the closest TSS using all transcripts that meet the TPM threshold, the closest TSS using only protein-coding transcripts that meet the TPM threshold, and the TPMs for each of those.

Assignment Logic

The original ROSE gene mapper just assigns the TSS that is closest to the center of the enhancer. rosewater takes a more sophisticated (and therefore complicated approach):

  • If the enhancer overlaps no TSSes for a gene that meets the TPM threshold:
    • The "final_allgene_assignment" will be set to the gene that meets the TPM threshold for the closest TSS while "final_proteincoding_assignment" will be set to the gene that meets the TPM threshold for the closest 'protein_coding' TSS.
  • If the enhancer overlaps a single TSS for a gene that meets the TPM threshold:
    • If the 'gene_type' of the gene is protein_coding, the "final_allgene_assignment" and "final_proteincoding_assignment" will both be set to that gene.
    • If the 'gene_type' of the gene is not protein_coding, the "final_allgene_assignment" will be set to that gene while the "final_proteincoding_assignment" will be set to the gene for the closest 'protein_coding' TSS.
  • If the enhancer overlaps two or more TSS for a gene that meets the TPM threshold:
    • If the 'gene_type' of the most highly-expressed gene is protein_coding, the "final_allgene_assignment" and "final_proteincoding_assignment" will both be set to that gene.
    • If the 'gene_type' of the most highly-expressed gene is not protein_coding, the "final_allgene_assignment" will be set to that gene while the "final_proteincoding_assignment" will be set to the most highly-expressed overlapping 'protein_coding' gene. If there are no overlapping TSSes for 'protein_coding' genes, the "final_proteincoding_assignment" will be set to the gene for the closest 'protein_coding' TSS.

Users are free to use whichever assignment they feel is most appropriate for their use case.

Known Issues

Users may get a warning like RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode depending on their version of python. This can be safely ignored. It stems from pybedtools and should be fixed in their next release.

Contributing

Feel free to submit a pull request or open an issue if you have an idea to enhance this tool.

License

rosewater is available under the GNU-GPLv3 license. It is provided as-is, with no warranty or guarantees.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rosewater-0.1.1.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

rosewater-0.1.1-py3-none-any.whl (19.4 kB view details)

Uploaded Python 3

File details

Details for the file rosewater-0.1.1.tar.gz.

File metadata

  • Download URL: rosewater-0.1.1.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.8.5

File hashes

Hashes for rosewater-0.1.1.tar.gz
Algorithm Hash digest
SHA256 00f7068894ff1dfb38354e2dd11b41c5b28ca45177774f09b9d6d75b9e3c3092
MD5 38dd5e9c6974e47b5d49404d1e283a70
BLAKE2b-256 6a9b9fca1bf86de75aed406f86a87d289a3ead2558c04fb120d360fe462f52aa

See more details on using hashes here.

File details

Details for the file rosewater-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: rosewater-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 19.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.8.5

File hashes

Hashes for rosewater-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 82a52126578aecb0fcc920c1a5cae522b37fd6f933dba0581bbea89c03d87ce6
MD5 55714a1e2c22828b5d2b7704029e8450
BLAKE2b-256 9dd03df2796b50a8c8f8e5609f7738ae2fb4b38167ddd1013cbd50a4de7b7a65

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page