Gene extraction and bibliometric analysis from literature tables
Project description
LitGeneMap
LitGeneMap is a Python package for extracting genes from literature tables and performing bibliometric analysis.
It is designed around a simple input rule: any literature table containing at least title and abstract can be used. By default, LitGeneMap analyzes title, abstract, and keyword, then produces gene frequency tables, optional temporal metrics, gene-gene co-occurrence, and full network/module outputs.
Core design
- Minimum required literature columns:
title,abstract - Default analyzed text columns:
title,abstract,keyword - Compatible with
.xlsx,.csv, and.tsv - Works with bibliometrix/WoS-style exports, but is not limited to them
- Supports either:
hgnc_complete_set.txtfor human analyses- a custom mapping table with
raw_termandstandard_symbolfor any species
- Supports partial runs:
frequency,cooccurrence, orfull - Includes a built-in default blacklist for highly ambiguous terms
Installation
From local source
pip install -e .
After PyPI release
pip install litgenemap
Quick start
Human HGNC, full analysis
litgenemap run --literature my_literature.xlsx --genes hgnc_complete_set.txt --gene-input hgnc --output results/
Custom species dictionary, full analysis
litgenemap run --literature my_literature.csv --genes rice_dictionary.csv --gene-input dictionary --output results_rice/
Frequency only
litgenemap run --literature my_literature.xlsx --genes hgnc_complete_set.txt --gene-input hgnc --analysis-level frequency --output results_freq/
Use a blacklist for ambiguous terms
litgenemap run --literature my_literature.xlsx --genes hgnc_complete_set.txt --gene-input hgnc --blacklist blacklist.txt --output results_clean/
Example dataset
A bundled example dataset is provided in example_data/ for quick validation of the pipeline and for reproducible demonstration purposes.
Included files:
example_data/demo_literature.csv: example literature recordsexample_data/demo_dictionary.csv: custom gene dictionary with alias-to-symbol normalizationexample_data/demo_blacklist.txt: optional custom blacklist for filtering ambiguous terms
Run the full pipeline with the bundled example data:
litgenemap run --literature example_data/demo_literature.csv --genes example_data/demo_dictionary.csv --gene-input dictionary --analysis-level full --output demo_output
This example covers:
- custom dictionary-based gene matching
- alias normalization such as
p53 -> TP53andHER2 -> ERBB2 - gene frequency analysis
- gene-gene co-occurrence analysis
- downstream network and module generation
Optional blacklist test:
litgenemap run --literature example_data/demo_literature.csv --genes example_data/demo_dictionary.csv --gene-input dictionary --analysis-level full --blacklist example_data/demo_blacklist.txt --output demo_output_blacklist
Input requirements
Minimum required literature columns
titleabstract
Default analyzed text columns
titleabstractkeyword
Optional metadata columns
yeardoikeywords_plus
Supported literature file formats
.xlsx.csv.tsv
Automatic column alias mapping
LitGeneMap automatically maps common source column names when possible:
title<-title/TI/TI_rawabstract<-abstract/AB/AB_rawkeyword<-keyword/keywords/author_keywords/DE/DE_rawkeywords_plus<-keywords_plus/IDyear<-year/PYdoi<-doi/DI
Gene input modes
1. --gene-input hgnc
Use a human HGNC raw table such as hgnc_complete_set.txt.
LitGeneMap will automatically:
- read the HGNC file
- keep approved genes by default
- keep protein-coding genes by default
- expand searchable terms from
symbol,alias_symbol, andprev_symbol
Human gene data can be obtained from the HGNC website.
2. --gene-input dictionary
Use a custom mapping table for any species.
Minimum required columns:
raw_termstandard_symbol
Example:
raw_term,standard_symbol
TP53,TP53
p53,TP53
BRCA1,BRCA1
This mode is useful for:
- non-human species
- custom curated dictionaries
- domain-specific controlled vocabularies
Default blacklist for ambiguous terms
LitGeneMap applies a built-in blacklist by default to reduce false positives caused by highly ambiguous short terms or common English words that may appear in HGNC aliases or custom dictionaries.
This is especially important for cases such as:
OFbeing mapped toBRIP1- very short or common words producing inflated gene frequency or co-occurrence counts
Default behavior:
- the built-in blacklist is applied automatically
--blacklist my_blacklist.txtadds your own blocked terms on top of the built-in blacklist--no-default-blacklistdisables the built-in blacklist
Recommended practice:
- keep the default blacklist enabled for routine analyses
- add your own blacklist for field-specific ambiguous terms
- only disable the default blacklist when you explicitly want raw matching behavior
Recommended literature source
Literature tables exported from the R package bibliometrix are recommended.
However, LitGeneMap is not limited to bibliometrix output. Any tabular literature dataset containing at least title and abstract can be used.
Analysis levels
frequency
Outputs:
- normalized literature table
- article-gene hits
- article-gene matrix
- gene frequency table
- temporal metrics when
yearis available
cooccurrence
Adds:
- gene-gene co-occurrence table
full
Adds:
- network edge table
- module assignments
- evidence scores
- top genes by module
Output files
Depending on the analysis level, LitGeneMap may produce:
articles_normalized.csvarticle_gene_hits.csvarticle_gene_matrix.csvgene_frequency.csvgene_cooccurrence.csvgene_network_edges.csvgene_modules.csvgene_module_evidence_table.csvtop_genes_by_module.csv
When using HGNC raw input, LitGeneMap may also export intermediate cleaned gene tables.
Command-line help
litgenemap --help
litgenemap run --help
Release workflow
Build distributions
python -m pip install --upgrade build twine
python -m build
Upload to TestPyPI
twine upload --repository testpypi dist/*
Upload to PyPI
twine upload dist/*
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file litgenemap-0.3.0.tar.gz.
File metadata
- Download URL: litgenemap-0.3.0.tar.gz
- Upload date:
- Size: 16.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e3981690378c1288c04f2a2784c32190489d7a58c5f8f1bfdcdb0fde4b24094
|
|
| MD5 |
78a7956ca2af16e9beaf6c9ec0745e1e
|
|
| BLAKE2b-256 |
30af9042cb0e0a7657381fac8a0258c2256aa676c7769ea3fa406f05080ac141
|
Provenance
The following attestation bundles were made for litgenemap-0.3.0.tar.gz:
Publisher:
publish.yml on whitecrowr/litgenemap
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
litgenemap-0.3.0.tar.gz -
Subject digest:
2e3981690378c1288c04f2a2784c32190489d7a58c5f8f1bfdcdb0fde4b24094 - Sigstore transparency entry: 1284063731
- Sigstore integration time:
-
Permalink:
whitecrowr/litgenemap@516aefc269e1be8fd567bba078a8bf0afde1f8b3 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/whitecrowr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@516aefc269e1be8fd567bba078a8bf0afde1f8b3 -
Trigger Event:
release
-
Statement type:
File details
Details for the file litgenemap-0.3.0-py3-none-any.whl.
File metadata
- Download URL: litgenemap-0.3.0-py3-none-any.whl
- Upload date:
- Size: 16.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee641d7af842ff149846776032d7b378858ff21b54095facde654577d482918a
|
|
| MD5 |
03ff687d9d14e8e017fff0d54b8dcca1
|
|
| BLAKE2b-256 |
bd0a753586c88837eda4172a9b60aabc6f08c234d2ed46e68a297aa83300cfa8
|
Provenance
The following attestation bundles were made for litgenemap-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on whitecrowr/litgenemap
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
litgenemap-0.3.0-py3-none-any.whl -
Subject digest:
ee641d7af842ff149846776032d7b378858ff21b54095facde654577d482918a - Sigstore transparency entry: 1284063839
- Sigstore integration time:
-
Permalink:
whitecrowr/litgenemap@516aefc269e1be8fd567bba078a8bf0afde1f8b3 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/whitecrowr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@516aefc269e1be8fd567bba078a8bf0afde1f8b3 -
Trigger Event:
release
-
Statement type: