Skip to main content

Gene extraction and bibliometric analysis from literature tables

Project description

LitGeneMap

LitGeneMap is a Python package for extracting genes from literature tables and performing bibliometric analysis.

It is designed around a simple input rule: any literature table containing at least title and abstract can be used. By default, LitGeneMap analyzes title, abstract, and keyword, then produces gene frequency tables, optional temporal metrics, gene-gene co-occurrence, and full network/module outputs.

Core design

  • Minimum required literature columns: title, abstract
  • Default analyzed text columns: title, abstract, keyword
  • Compatible with .xlsx, .csv, and .tsv
  • Works with bibliometrix/WoS-style exports, but is not limited to them
  • Supports either:
    • hgnc_complete_set.txt for human analyses
    • a custom mapping table with raw_term and standard_symbol for any species
  • Supports partial runs: frequency, cooccurrence, or full
  • Includes a built-in default blacklist for highly ambiguous terms

Installation

From local source

pip install -e .

After PyPI release

pip install litgenemap

Quick start

Human HGNC, full analysis

litgenemap run --literature my_literature.xlsx --genes hgnc_complete_set.txt --gene-input hgnc --output results/

Custom species dictionary, full analysis

litgenemap run --literature my_literature.csv --genes rice_dictionary.csv --gene-input dictionary --output results_rice/

Frequency only

litgenemap run --literature my_literature.xlsx --genes hgnc_complete_set.txt --gene-input hgnc --analysis-level frequency --output results_freq/

Use a blacklist for ambiguous terms

litgenemap run --literature my_literature.xlsx --genes hgnc_complete_set.txt --gene-input hgnc --blacklist blacklist.txt --output results_clean/

Example dataset

A bundled example dataset is provided in example_data/ for quick validation of the pipeline and for reproducible demonstration purposes.

Included files:

  • example_data/demo_literature.csv: example literature records
  • example_data/demo_dictionary.csv: custom gene dictionary with alias-to-symbol normalization
  • example_data/demo_blacklist.txt: optional custom blacklist for filtering ambiguous terms

Run the full pipeline with the bundled example data:

litgenemap run --literature example_data/demo_literature.csv --genes example_data/demo_dictionary.csv --gene-input dictionary --analysis-level full --output demo_output

This example covers:

  • custom dictionary-based gene matching
  • alias normalization such as p53 -> TP53 and HER2 -> ERBB2
  • gene frequency analysis
  • gene-gene co-occurrence analysis
  • downstream network and module generation

Optional blacklist test:

litgenemap run --literature example_data/demo_literature.csv --genes example_data/demo_dictionary.csv --gene-input dictionary --analysis-level full --blacklist example_data/demo_blacklist.txt --output demo_output_blacklist

Input requirements

Minimum required literature columns

  • title
  • abstract

Default analyzed text columns

  • title
  • abstract
  • keyword

Optional metadata columns

  • year
  • doi
  • keywords_plus

Supported literature file formats

  • .xlsx
  • .csv
  • .tsv

Automatic column alias mapping

LitGeneMap automatically maps common source column names when possible:

  • title <- title / TI / TI_raw
  • abstract <- abstract / AB / AB_raw
  • keyword <- keyword / keywords / author_keywords / DE / DE_raw
  • keywords_plus <- keywords_plus / ID
  • year <- year / PY
  • doi <- doi / DI

Gene input modes

1. --gene-input hgnc

Use a human HGNC raw table such as hgnc_complete_set.txt.

LitGeneMap will automatically:

  • read the HGNC file
  • keep approved genes by default
  • keep protein-coding genes by default
  • expand searchable terms from symbol, alias_symbol, and prev_symbol

Human gene data can be obtained from the HGNC website.

2. --gene-input dictionary

Use a custom mapping table for any species.

Minimum required columns:

  • raw_term
  • standard_symbol

Example:

raw_term,standard_symbol
TP53,TP53
p53,TP53
BRCA1,BRCA1

This mode is useful for:

  • non-human species
  • custom curated dictionaries
  • domain-specific controlled vocabularies

Default blacklist for ambiguous terms

LitGeneMap applies a built-in blacklist by default to reduce false positives caused by highly ambiguous short terms or common English words that may appear in HGNC aliases or custom dictionaries.

This is especially important for cases such as:

  • OF being mapped to BRIP1
  • very short or common words producing inflated gene frequency or co-occurrence counts

Default behavior:

  • the built-in blacklist is applied automatically
  • --blacklist my_blacklist.txt adds your own blocked terms on top of the built-in blacklist
  • --no-default-blacklist disables the built-in blacklist

Recommended practice:

  • keep the default blacklist enabled for routine analyses
  • add your own blacklist for field-specific ambiguous terms
  • only disable the default blacklist when you explicitly want raw matching behavior

Recommended literature source

Literature tables exported from the R package bibliometrix are recommended.

However, LitGeneMap is not limited to bibliometrix output. Any tabular literature dataset containing at least title and abstract can be used.

Analysis levels

frequency

Outputs:

  • normalized literature table
  • article-gene hits
  • article-gene matrix
  • gene frequency table
  • temporal metrics when year is available

cooccurrence

Adds:

  • gene-gene co-occurrence table

full

Adds:

  • network edge table
  • module assignments
  • evidence scores
  • top genes by module

Output files

Depending on the analysis level, LitGeneMap may produce:

  • articles_normalized.csv
  • article_gene_hits.csv
  • article_gene_matrix.csv
  • gene_frequency.csv
  • gene_cooccurrence.csv
  • gene_network_edges.csv
  • gene_modules.csv
  • gene_module_evidence_table.csv
  • top_genes_by_module.csv

When using HGNC raw input, LitGeneMap may also export intermediate cleaned gene tables.

Command-line help

litgenemap --help
litgenemap run --help

Release workflow

Build distributions

python -m pip install --upgrade build twine
python -m build

Upload to TestPyPI

twine upload --repository testpypi dist/*

Upload to PyPI

twine upload dist/*

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litgenemap-0.3.0.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

litgenemap-0.3.0-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file litgenemap-0.3.0.tar.gz.

File metadata

  • Download URL: litgenemap-0.3.0.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litgenemap-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2e3981690378c1288c04f2a2784c32190489d7a58c5f8f1bfdcdb0fde4b24094
MD5 78a7956ca2af16e9beaf6c9ec0745e1e
BLAKE2b-256 30af9042cb0e0a7657381fac8a0258c2256aa676c7769ea3fa406f05080ac141

See more details on using hashes here.

Provenance

The following attestation bundles were made for litgenemap-0.3.0.tar.gz:

Publisher: publish.yml on whitecrowr/litgenemap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file litgenemap-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: litgenemap-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litgenemap-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ee641d7af842ff149846776032d7b378858ff21b54095facde654577d482918a
MD5 03ff687d9d14e8e017fff0d54b8dcca1
BLAKE2b-256 bd0a753586c88837eda4172a9b60aabc6f08c234d2ed46e68a297aa83300cfa8

See more details on using hashes here.

Provenance

The following attestation bundles were made for litgenemap-0.3.0-py3-none-any.whl:

Publisher: publish.yml on whitecrowr/litgenemap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page