Library for normalizing entities based on a dictionary
Project description
Entity Normalizer
Python tool for normalizing entities based on a dictionary.
Usage
This tool can be used as:
- a command line tool, by cloning this repository and running
pip install .
under the root of this source; then you can runmain.py
with the required parameters to process your entity-listed file. - a Python package , by installing the package using
pip install EntityNormalizer
Input and output
The input file must contain one entity per line.
The output file will contain the normalized entities, again, one per line.
The dictionary file must be a comma-separated table file, i.e., csv
.
If the entity does not produce any match in the dictionary, it will be normalized to [NO_MATCH]
.
If the entity is found in the dictionary but the normalization is empty, it will be normalized to [NO_NORM_FOUND]
.
Command line usage
python main.py input output dictionary source target [--matching_threshold MATCHING_THRESHOLD] [--index]
Parameters
input
: Input file path [Required]output
: Output file path [Required]dictionary
: Normalization dictionary file path [Required]source
: Surface form column from dictionary [Required]target
: Normalization column from dictionary [Required]matching_threshold
: Threshold of string similarity for the normalization to be accepted (default: 50) [Optional]index
: Use column indexes instead of names [Optional]
Example
-
With column names:
python main.py data/input.txt data/output.txt data/dictionary.csv surface_form_col normalization_col --matching_threshold 50
-
With integer column indexes:
python main.py data/input.txt data/output.txt data/dictionary.csv --index source 0 target 2 --matching_threshold 80
Python package usage
After installation, the normalize
function can be invoked with the dicitonary and a list
of entities to produce a list
of normalized entities.
Example
from EntityNormalizer import EntityDictionary, normalize
entities = ['entity1', 'entity2', 'entity3']
normalization_dictionary = EntityDictionary('data/dictionary.csv', 'surface_forms', 'normalizations')
normalized = normalize(entities, normalization_dictionary, matching_threshold=70)
print(normalized)
Bundled dictionaries
This library comes with a set of bundled dictionaries, which can be found under the resources
folder:
- MedDic-CANCER-ADE-JA
- MedDic-CANCER-DRUG-JA
These are a set of Japanese medical dictionaries developed with normalization of concepts normally found during the analysis of adverse events caused by anticancer drugs. Please refer to this page for mor information.
There are convenient classes for loading these dictionaries, which can be accessed with the Dictionaries
module:
from EntityNormalizer import Dictionaries, normalize
entities = ['entity1', 'entity2', 'entity3']
# Load the dictionaries
cancer_ade = Dictionaries.MedDicCancerADE()
cancer_drug = Dictionaries.MedDicCancerDrug()
# Use the dictionaries
normalized_ade = normalize(entities, cancer_ade, matching_threshold=70)
normalized_drug = normalize(entities, cancer_drug, matching_threshold=70)
Both dictionaries use the columns 出現形
(Surface form) and [細分類]
(Sub-classification) as source and target
columns, respectively.
This can be altered by passing the referring parameter when creating the dictionary:
from EntityNormalizer import Dictionaries
cancer_ade = Dictionaries.MedDicCancerADE(source_column='customColumn', target_column='customColumn2')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file EntityNormalizer-0.2.0.tar.gz
.
File metadata
- Download URL: EntityNormalizer-0.2.0.tar.gz
- Upload date:
- Size: 41.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | afa426276dc9b91ffcf6ebd9e8fb82b636c4942d702eb59508166a5db7522e6f |
|
MD5 | cbf1492440ffb616ce669a6fdd4ee5c2 |
|
BLAKE2b-256 | 33c8e1bf2cc5c820f5e0dd530775ec019639f44a458c0a4619d5b9ae05c24896 |
File details
Details for the file EntityNormalizer-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: EntityNormalizer-0.2.0-py3-none-any.whl
- Upload date:
- Size: 41.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4fb6c79049f1e3d8d12852aee7910aa8205dc61d1e411a53005f685a1bd383db |
|
MD5 | e2edeffe6ee02f3b9f954c3c8b65eb53 |
|
BLAKE2b-256 | 4ab5afd3959b77241974977c7a000691c90f1172ca279986489b7c53b27103a7 |