Skip to main content

Compute window aggregations and alter contents of Amethyst HDF5 files

Project description

Facet is an efficient utility for computing window aggregations on Amethyst HDF5 files produced via the premethyst pipeline.

Create environment

Install facet.py dependencies using mamba:

mamba create -n facet python=3.12 numpy click polars parse h5py duckdb

Compute Window Aggregations

python facet.py agg will add window aggregations to an existing HDF5 file in version 2.0.0 (see below for information on file format conversion).

Example:

python agg -u 500 -u step_1000=1000:250 -w special_fancy_windows=windows.tsv -p 55 *.h5

This computes several types of windows.

  • -u 500 computes uniform non-overlapping 500bp windows. These will be stored in /[context]/[barcode]/[window_size] by default. A custom name can be chosen by prepending -u [dataset_name]=500.
  • -u step_1000=1000:250 computes 1000bp windows with a 250bp step, so intervals will be computed at $[0, 1000), [250, 1250), ...$. This example uses a custom name of step_1000. The default is to use [window_size]_by_[step_size], which in this case would have been 1000_by_250.
  • -w special_fancy_windows=windows.tsv computes aggregations over custom windows defined in a CSV-like file. The headers chr, start and end are required but the file format is sniffed by DuckDB (csv, tsv etc are allowed). Intervals are left-closed right-open, i.e. $[start, end)$ and may be overlapping and gapped.

The -p 55 option parallelizes the computation using 55 worker cores. All HDF5 files retrieved via *.h5 will have windows computed in this case. Multiple globs can be specified, i.e. -glob path1/*.h5 -glob path2/*.h5.

Other options are described in python facet.py agg --help.

Help

The options for facet.py can be explored at the command line by appending --help.

Example:

$ python facet.py --help
Usage: facet.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  agg      Compute window sums over methylation observations stored in...
  convert  Convert an old Amethyst HDF5 file format to v2.0.0 format
  delete   Delete contexts, barcodes, or datasets from an Amethyst 2.0.0...
  version

You can also call --help on subcommands. Example:

python facet.py agg --help

Convert old Amethyst HDF5 file format to version 2.0.0

File format conversion is necessary prior to computing window aggregations using facet.py for Amethyst HDF5 files produced using earlier scripts.

Example:

python facet.py convert old_format.h5 new_format.h5

Explanation and schema comparison:

The old Amethyst HDF5 format stored datasets under a cell barcode under a context group:

/[context]/[barcode]

context values are typically CH and CG. The barcode values are unique identifiers attributed to single cells. Typically each value of barcode is found in both the CH and CG contexts.

The schema of barcode was chr, pos, pct, c, t, with chr the chromosome name, pos the bp position of the observation, pct equal to c/(c+t), and c and t the methylated and unmethylated count at that position.

This gave no clear way to store window aggregations alongside the bp-resolution observations. We therefore altered the schema to:

/[context]/[barcode]/[dataset]

The bp-resolution observations are stored under the dataset 1 by default. Window aggregations are stored under their context and barcode under other names. The schema for window aggregations is chr, start, end, c, t, c_nz, t_nz. The start and end values denote the interval $[start, end)$. The c and t values store the sum of c and t counts for observed positions on that interval. Intervals with no observations are not reported. The c_nz and t_nz fields store the count of positions where c >= 1 or t >= 1 respectively.

Delete datasets

Examples:

python facet.py delete context CH *.h5
python facet.py delete barcode AGCGAGCGAGCAHHCAHH *.h5
python facet.py delete dataset 1 *.h5

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amethyst_facet-1.0.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

amethyst_facet-1.0-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file amethyst_facet-1.0.tar.gz.

File metadata

  • Download URL: amethyst_facet-1.0.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/7.0.1 keyring/24.3.0 pkginfo/1.9.6 readme-renderer/34.0 requests-toolbelt/1.0.0 requests/2.31.0 rfc3986/1.5.0 tqdm/4.66.1 urllib3/1.26.5 CPython/3.10.12

File hashes

Hashes for amethyst_facet-1.0.tar.gz
Algorithm Hash digest
SHA256 8cb22f5c7df181e4deeef100c80306a62db75e6517fe2e4222fe343e1e5c8838
MD5 2e983cb16de97036d1ffd1bd6f049243
BLAKE2b-256 c061842f94a78d2822282b4331e162765cfb74e4d71dcd59a87495fb1e21d911

See more details on using hashes here.

File details

Details for the file amethyst_facet-1.0-py3-none-any.whl.

File metadata

  • Download URL: amethyst_facet-1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/7.0.1 keyring/24.3.0 pkginfo/1.9.6 readme-renderer/34.0 requests-toolbelt/1.0.0 requests/2.31.0 rfc3986/1.5.0 tqdm/4.66.1 urllib3/1.26.5 CPython/3.10.12

File hashes

Hashes for amethyst_facet-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2e37fbf9bfb3d74900fd033c2c8e8583df8a907fc9bf9f5f254c93b020ba7d91
MD5 51b9a15e5355b16904c8701a93d3d934
BLAKE2b-256 641ecdea59979c2f7024e1fe2036576a82cc37dee390e02bbcdba50a61d0d714

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page