No-code cleaning of messy gene/protein identifier tables, fully offline, with explicit ambiguity handling.

These details have not been verified by PyPI

Project links

Project description

gene-tidy

Clean messy gene/protein identifier tables — fully offline, fully audited, no code required.

You copy a gene list out of a paper or a supplementary Excel file and half of it is garbage: p53 instead of TP53, SEPT2 silently turned into 2-Sep, dead symbols, IDs in five different formats. Drop the file in — get back clean HGNC symbols (plus Ensembl / UniProt / Entrez / RefSeq), an audit CSV, and a paste-ready methods paragraph. Ambiguous cases are flagged, never guessed, and nothing is dropped.

pip install gene-tidy
gene-tidy messy_genes.csv --out outputs/   # -> clean / ambiguous / failed + audit + methods

What that looks like

You typed	gene-tidy gives you
`p53`	→ TP53 (resolved from alias)
`HER2`	→ ERBB2 (resolved from alias)
`FRAP1`	→ MTOR (resolved from previous symbol)
`ENSG00000141510`	→ TP53 (Ensembl gene ID)
`Sep-7`	→ SEPTIN7 (recovered from Excel date corruption)
`2-Sep`	⚠️ ambiguous: SEPTIN2 / SEPTIN6 — flagged for review, not guessed
`FOOBAR1`	❌ no match → `failed_rows.csv` (nothing silently dropped)

Inspired by HGNChelper (R), but in Python, mapping to all major IDs, with explicit ambiguity handling and Excel date-corruption recovery (SEPT2 → "2-Sep", MARCH1 → "1-Mar").

gene-tidy demo

Scope & limitations

gene-tidy is HGNC-centered, offline, and reproducible: it standardises human gene/protein identifiers against a bundled static HGNC complete set and records exactly which version it used. If you need a fast, offline, auditable HGNC cleanup you can cite in a methods section, this is for you.

What it deliberately does not do — use BioMart / VEP / UniProt for these:

Human only. No other species.
Static, bundled HGNC subset. No live lookups, no API keys, no always-current data — the exact version is pinned and recorded on every run.
Gene-level only. Ensembl transcript/protein IDs (ENST… / ENSP…) are detected but not resolved offline; they are flagged for manual review.
No variant or clinical interpretation. No HGVS / ClinVar / VEP / gnomAD / liftover / genome-build detection.
No reinterpretation of numeric Excel date serials (e.g. 44075) — they are indistinguishable from Entrez IDs and are intentionally left untouched.

Why

Works offline, out of the box. The full HGNC complete set (all Approved gene records) ships inside the package as a gzipped TSV. No network, no API keys, no surprises — and the exact HGNC version is recorded in every run.
Never guesses silently. One-to-many or uncertain mappings are flagged ambiguous / manual_review_required and routed to a separate file.
Never drops a row. Clean, ambiguous, and failed rows are all accounted for.
Reproducible. Every run emits a methods_text.txt paragraph (tool version, HGNC version + date) ready to paste into a supplementary methods section.

Install

pip install gene-tidy

Requires Python 3.10+. Dependencies: pandas, openpyxl, typer.

To install the latest development version directly from GitHub:

pip install git+https://github.com/MargoSolo/gene-tidy.git

From source (recommended for development):

git clone https://github.com/MargoSolo/gene-tidy
cd gene-tidy
pip install -e .

Quickstart (CLI)

gene-tidy input.xlsx --out outputs/

That's it. outputs/ will contain six files (see below). Works the same on .txt, .csv, .tsv, and .xlsx:

gene-tidy my_genes.txt --out outputs/
gene-tidy supp_table.csv --out outputs/

Useful flags:

gene-tidy data.xlsx -o out/ --column gene_symbol   # force the identifier column
gene-tidy data.csv  -o out/ --column symbol -c ensembl_id   # multiple columns
gene-tidy data.xlsx -o out/ --hgnc-file hgnc_complete_set.txt  # use the full HGNC set
gene-tidy --version                                 # tool + HGNC dump version

Quickstart (Python)

from gene_tidy import tidy_file, tidy_values

# Whole file -> writes the six output files, returns a result object.
result = tidy_file("supp_table.xlsx", "outputs/")
print(result.counts)   # {'total': 21, 'clean': 16, 'ambiguous': 3, 'failed': 2}

# Or clean an in-memory list of identifiers (no files written):
result = tidy_values(["TP53", "p53", "Sep-7", "ENSG00000141510", "1-Mar", "FOOBAR1"])
print(result.audit[["input_value", "approved_symbol", "match_status"]])

       input_value approved_symbol     match_status
0             TP53            TP53          matched
1              p53            TP53    matched_alias
2            Sep-7         SEPTIN7  recovered_excel
3  ENSG00000141510            TP53          matched
4            1-Mar  MARCHF1;MTARC1        ambiguous
5          FOOBAR1                          unmatched

1-Mar (ambiguous between MARCHF1 and MTARC1) lands in result.ambiguous; FOOBAR1 lands in result.failed. Nothing is dropped.

What it handles

Input	Example	Result
Approved symbol	`TP53`	`matched` → TP53
Alias symbol	`p53`, `HER2`	`matched_alias` (warns "resolved from alias")
Previous symbol	`FRAP1`, `VEGF`	`matched_prev` (warns "resolved from previous symbol")
Ensembl gene	`ENSG00000141510`	`matched` → TP53
UniProt	`P38398`	`matched` → BRCA1
Entrez	`672`	`matched` → BRCA1
RefSeq	`NM_000546`	`matched` → TP53
HGNC ID	`HGNC:11998`	`matched` → TP53
Excel date corruption	`Sep-7`	`recovered_excel` → SEPTIN7 (always warns)
Ambiguous corruption	`1-Mar`, `2-Sep`, `1-Dec`	`ambiguous` → e.g. MARCHF1/MTARC1, SEPTIN2/SEPTIN6 → manual review
Multiple IDs per cell	`KRAS, NRAS`	split and resolved independently
Case / whitespace	`tp53`	normalised → TP53
Duplicates	`TP53` ×2	kept, flagged in `warning`
No match	`FOOBAR1`	`unmatched` → `failed_rows.csv`

Output files

Every run writes six files to --out:

File	Contents
`clean_table.xlsx` / `clean_table.csv`	confidently resolved rows
`ambiguous_rows.csv`	one-to-many / uncertain rows needing manual review
`failed_rows.csv`	unmatched and empty rows
`mapping_audit.csv`	every input → output, with full provenance (see below)
`methods_text.txt`	paste-ready methods paragraph (tool + HGNC version/date)

Columns (required schema)

input_value, detected_type, approved_symbol, hgnc_id, ensembl_gene_id, uniprot_id, entrez_id, refseq_id, match_status, warning, source_used, manual_review_required (plus source_row / source_column for traceability back to the original table).

match_status is one of: matched, matched_alias, matched_prev, recovered_excel (→ clean) · ambiguous (→ review) · unmatched, empty (→ failed).

Every table also carries per-row provenance — matched_field (which HGNC field matched: symbol / alias_symbol / prev_symbol / ensembl_gene_id / uniprot_ids / entrez_id / refseq_accession / hgnc_id / excel_recovery), match_reason (human-readable), and candidate_count (1 for a clean hit, N for ambiguous, 0 for no match). mapping_audit.csv additionally records hgnc_dump_date and gene_tidy_version on every row for full reproducibility.

Source of truth & offline guarantee

Resolution runs against a static, bundled HGNC complete set — src/gene_tidy/data/hgnc_complete_set.tsv.gz, containing all ~45,000 Approved HGNC gene records — matched against the approved symbol, alias_symbol, and prev_symbol fields. The accompanying hgnc_version.json records the source URL, HGNC license (CC0), download date, release tag, and record count; the same provenance is printed by gene-tidy --version, written into every mapping_audit.csv row, and summarised in methods_text.txt.

To use a different / newer HGNC release, pass --hgnc-file path/to/hgnc_complete_set.txt, set the GENE_TIDY_HGNC_FILE environment variable, or regenerate the bundled dump with python tools/build_hgnc_data.py hgnc_complete_set.txt. A user-supplied file is filtered to status == Approved automatically.

The package and its tests never require network access. (The test suite resolves against a tiny curated fixture in tests/fixtures/ for speed; the real bundled dump is exercised separately in tests/test_data_boundary.py.)

Real-world note: because the bundled data is the real HGNC set, genuine one-to-many cases surface honestly. For example SEPT2 is a previous symbol of SEPTIN2 and an alias of SEPTIN6, so gene-tidy reports it ambiguous rather than guessing.

Colab notebook

Zero-setup, in-browser: upload a file → run → preview clean/failed/ambiguous rows → download a ZIP of all outputs. A bundled messy_example.xlsx lets you click Run and see results immediately.

notebooks/gene_tidy_colab.ipynb

Development

pip install -e ".[dev]"   # installs pytest, build, and twine
pytest                    # 116 tests, all offline

Test coverage: ID-type detection, column detection, resolver (alias / prev / Excel-corruption / ambiguity), input/output file handling, CLI, golden-output regression on the bundled example, and a data-boundary test that loads the real bundled HGNC complete set. Most tests use a small curated fixture (tests/fixtures/hgnc_subset.tsv) so the suite runs in seconds.

To refresh the bundled HGNC data (deterministic: the same input always produces a byte-identical .tsv.gz, and the run records raw_download_sha256 + bundled_tsv_gz_sha256 in hgnc_version.json):

python tools/build_hgnc_data.py path/to/hgnc_complete_set.txt   # from a pinned file
python tools/build_hgnc_data.py --download                      # or fetch current

Attribution & citing HGNC

gene-tidy resolves identifiers using data from the HUGO Gene Nomenclature Committee (HGNC).

Source: HGNC complete set (hgnc_complete_set.txt) from the HGNC download archive. The exact source URL, snapshot date, and SHA-256 hashes are recorded in src/gene_tidy/data/hgnc_version.json.
Snapshot bundled in this release: see downloaded_date and bundled_tsv_gz_sha256 in src/gene_tidy/data/hgnc_version.json (also printed by gene-tidy --version and written into every mapping_audit.csv / methods_text.txt).
License: HGNC data are released under a CC0 1.0 public-domain dedication, so they are free to redistribute; gene-tidy bundles a column-trimmed, Approved-only snapshot.
Recommendation: in your own methods/supplementary text, cite HGNC and state the retrieval month/year of the dump you used (e.g. "HGNC complete set, retrieved June 2026, via gene-tidy v0.1.0"). The exact date and hash are in hgnc_version.json and the generated methods_text.txt.

Please cite HGNC: Seal RL, et al. Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 2023;51(D1):D1003–D1009.

License

gene-tidy itself is MIT — see LICENSE. The bundled HGNC data is CC0 (see Attribution above).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Jun 21, 2026

0.1.0

Jun 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gene_tidy-0.1.1.tar.gz (1.5 MB view details)

Uploaded Jun 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gene_tidy-0.1.1-py3-none-any.whl (1.4 MB view details)

Uploaded Jun 21, 2026 Python 3

File details

Details for the file gene_tidy-0.1.1.tar.gz.

File metadata

Download URL: gene_tidy-0.1.1.tar.gz
Upload date: Jun 21, 2026
Size: 1.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for gene_tidy-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3812eba117f3f595142be558655b3b8e5b385f910ae71ef0848ebea4a15175bb`
MD5	`730ab9a27780c208dbd8cc33791c2409`
BLAKE2b-256	`e005274f3db562785a6eabe792bd0ad0e0cda484032c3727fbbf8edd0e846a7b`

See more details on using hashes here.

File details

Details for the file gene_tidy-0.1.1-py3-none-any.whl.

File metadata

Download URL: gene_tidy-0.1.1-py3-none-any.whl
Upload date: Jun 21, 2026
Size: 1.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for gene_tidy-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9fd1e87f68515d0cae6ff8d157ecb000497a6fd52ad299ac7944b7d7dd2d43f6`
MD5	`31ddd45a328de149531351b02f7a5fc9`
BLAKE2b-256	`e8eeed7487c3fac6aa8c83a63e620f0d3e82698bd5f717a3b76bbe48d9925a0e`

See more details on using hashes here.

gene-tidy 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

gene-tidy

What that looks like

Scope & limitations

Why

Install

Quickstart (CLI)

Quickstart (Python)

What it handles

Output files

Columns (required schema)

Source of truth & offline guarantee

Colab notebook

Development

Attribution & citing HGNC

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes