Add your description here

Project description

hugo-unifier

This python package can unify gene symbols across datasets based on the HUGO database.

Installation

The package can be installed via pip, or any other Python package manager.

pip install hugo-unifier

Usage

The package can be used both as a command line tool and as a library. It operates in a two-step process:

Take the symbols from the input data and create a list of operations to unify them, including a reason for the change
Apply the operations to the input data

Command Line Tool

hugo-unifier get --outdir . test1.h5ad test2.h5ad

This will create two files, test1_changes.csv and test2_changes.csv in the current directory. These files can be manually inspected to see what changes will be made and what the reasons for each change are.

The command line tool can also be used to apply the changes to the input data:

hugo-unifier apply --input test1.h5ad --changes test1_changes.csv --output test1_unified.h5ad
hugo-unifier apply --input test2.h5ad --changes test2_changes.csv --output test2_unified.h5ad

Library

Similar to the command line tool, the library can be used to get the changes and apply them to the input data.

from hugo_unifier import get_changes, apply_changes
import anndata as ad

adata_test1 = ad.read_h5ad("test1.h5ad")
adata_test2 = ad.read_h5ad("test2.h5ad")

dataset_symbols = {
   "test1": adata_test1.var.index.tolist(),
   "test2": adata_test2.var.index.tolist(),
}

# Get the changes
G, sample_changes = get_changes(dataset_symbols)

changes_test1 = sample_changes["test1"]
changes_test2 = sample_changes["test2"]

# Apply the changes
adata_test1_unified = apply_changes(adata_test1, changes_test1)
adata_test2_unified = apply_changes(adata_test2, changes_test2)

How it works

Step 1: Get HUGO data for symbols while applying manipulations

The first step is to get the HUGO data for the symbols in the input data. However, sometimes symbols contain artifacts like dots instead of dashes, or numbers following dots indicating a version. As these are mostly not detected in the HUGO database, we try to manipulate the symbols until the HUGO database returns a result. The manipulations are done in the following order:

Keep the symbol as-is
Replace dots with dashes
Remove everything after the first dot

If one of the manipulations returns a result for a given symbol, we do not try the others for that symbol. Notably, we start with the most conservative approach, keeping the symbol as-is, and only try the other manipulations if that fails.

Step 2: Build a symbol graph

Different symbols can sometimes have quite complex relationships. For example, a symbol can be an alias or a previous symbol for multiple other symbols, or a symbol can have multiple aliases or previous symbols. These relationships can be nicely visualized in a graph.

An example for this is shown here:

Graph example

Green nodes are approved symbols, blue ones are not.

The graph is constructed as follows:

Add a node for each of the following:
- Original symbols from the input data
- Manipulated symbols that arise within the process
- Symbols returned by the HUGO database
Save the datasets that have the symbol within the node with the exact same name
Draw edges for the following relationships:
- Manipulations (e.g. dot to dash)
- HUGO relations (Alias, Previous symbol, Approved symbol)

Clean the graph

This includes only two steps:

Remove self-loops (edges from a node to itself)
Remove all nodes that meet the following conditions (and are thus irrelevant for the unification):
- Node has exactly one incoming edge, that originates from an approved symbol
- Node is an approved symbol which is not represented in the input data

Step 3: Find unification opportunities

Currently, there are two approaches implemented. This can be easily extended in the future.

Resolve unapproved symbols

Iterate over all nodes in the graph that represent unapproved symbols and try to find an optimal solution for them. The optimal solution is decided as follows:

If the node has only one outgoing edge, the optimal solution is the target of that edge
If the node has multiple outgoing edges, we check if the targets of the edges are represented in any datasets. If there is exactly one target that is represented in any datasets, we use that one. If there are multiple, we mark it as a conflict and do not resolve it. If there is none, we do not resolve it either.

Now we have a source and a target node. Based on this, we can check if there is any dataset that has both the symbols in the source and target node. If that is the case, we would potentially loose some information if we would eliminate the source node. Thus, we do the following:

If an overlap exists (like the "Devlin" dataset in the following example), copy the symbols that are exclusive to the source node to the target node
If no overlap exists, we can safely remove the source node and rename all symbols from the source node to the target node

Aggregate approved symbols

This tries to resolve situations where one group of datasets contains one approved symbol, while another group of datasets contains another approved symbol, while one is an alias of the other. The logic is as follows:

Iterate all nodes representing approved symbols
Get all predecessors of the node
Get the union of the represented datasets of all predecessors and the node itself
Get the maximum number of datasets that are represented by any single predecessor or the node itself
Calculate the improvement ratio as the union size divided by the maximum size
If the improvement ratio is greater than 1.5, copy the symbols from all predecessors to the node

In the example below, the STRA13 gene would be copied to CENPX for all samples that have CENPX but not STRA13. This is because the union is 9 and the largest number of datasets in a single one of the two nodes is 6 in CENPX. The improvement ratio is exactly 1.5, so the copy is done.

Aggregation of approved symbols

Step 4: Provide change dataframe

All changes that are made to the graph are also stored in form of a dataframe, that is made available to the user for inspection. Before the dataframe is returned, it is split into smaller per-dataset dataframes.

If hugo-unifier is used via CLI, these dataframes are saved to the output directory. If hugo-unifier is used via the library, the dataframes are returned as a dictionary with the dataset names as keys and the dataframes as values.

Step 5: Apply changes to the input data

The content of a single-dataset change dataframe is applied to the corresponding input dataset. Basically all the change entries are applied one-by-one to the input dataset, in the same order as they were detected in the graph unification process.

Project details

Release history Release notifications | RSS feed

0.3.2

Mar 12, 2026

0.3.1

Jan 16, 2026

0.3.0

Aug 26, 2025

0.2.8

Jul 28, 2025

0.2.7

May 2, 2025

0.2.6

Apr 26, 2025

0.2.5

Apr 26, 2025

0.2.4

Apr 22, 2025

0.2.3

Apr 11, 2025

0.2.2

Apr 10, 2025

0.2.1

Apr 10, 2025

This version

0.2.0

Apr 10, 2025

0.1.2

Apr 8, 2025

0.1.1

Apr 8, 2025

0.1.0

Apr 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hugo_unifier-0.2.0.tar.gz (5.8 MB view details)

Uploaded Apr 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hugo_unifier-0.2.0-py3-none-any.whl (12.9 kB view details)

Uploaded Apr 10, 2025 Python 3

File details

Details for the file hugo_unifier-0.2.0.tar.gz.

File metadata

Download URL: hugo_unifier-0.2.0.tar.gz
Upload date: Apr 10, 2025
Size: 5.8 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for hugo_unifier-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`c187b34b7421ed1a3286dcb96624fb0ac8a9fc1a5a2cc703e2c073a067032552`
MD5	`0f8a0579768977f89169440ad7c658b2`
BLAKE2b-256	`c365e48f2e9aca5a7abc8fb3d0b4fbd2164f514bfeef73c07d970f77aff37b46`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hugo_unifier-0.2.0.tar.gz:

Publisher: ci.yml on Mye-InfoBank/hugo-unifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hugo_unifier-0.2.0.tar.gz
- Subject digest: c187b34b7421ed1a3286dcb96624fb0ac8a9fc1a5a2cc703e2c073a067032552
- Sigstore transparency entry: 195084803
- Sigstore integration time: Apr 10, 2025
Source repository:
- Permalink: Mye-InfoBank/hugo-unifier@114b1b2f2a6818210a2678206192997af8c77c12
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Mye-InfoBank
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@114b1b2f2a6818210a2678206192997af8c77c12
- Trigger Event: release

File details

Details for the file hugo_unifier-0.2.0-py3-none-any.whl.

File metadata

Download URL: hugo_unifier-0.2.0-py3-none-any.whl
Upload date: Apr 10, 2025
Size: 12.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for hugo_unifier-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`36166f662c7dea3245847559a69749f37b68a13c99b552543237fc165c97a441`
MD5	`279074340132eb323e1845f13dfb7a0a`
BLAKE2b-256	`2118a3c13e416a3c3497a3191df2189767909328151dc608af41a4ed915610c9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hugo_unifier-0.2.0-py3-none-any.whl:

Publisher: ci.yml on Mye-InfoBank/hugo-unifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hugo_unifier-0.2.0-py3-none-any.whl
- Subject digest: 36166f662c7dea3245847559a69749f37b68a13c99b552543237fc165c97a441
- Sigstore transparency entry: 195084804
- Sigstore integration time: Apr 10, 2025
Source repository:
- Permalink: Mye-InfoBank/hugo-unifier@114b1b2f2a6818210a2678206192997af8c77c12
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Mye-InfoBank
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@114b1b2f2a6818210a2678206192997af8c77c12
- Trigger Event: release

hugo-unifier 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

hugo-unifier

Installation

Usage

Command Line Tool

Library

How it works

Step 1: Get HUGO data for symbols while applying manipulations

Step 2: Build a symbol graph

Clean the graph

Step 3: Find unification opportunities

Resolve unapproved symbols

Aggregate approved symbols

Step 4: Provide change dataframe

Step 5: Apply changes to the input data

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance