Skip to main content

Library for normalizing entities based on a dictionary

Project description

Entity Normalizer

Python tool for normalizing entities based on a dictionary.

Usage

This tool can be used as:

  • a command line tool, by cloning this repository and running pip install . under the root of this source; then you can runmain.py with the required parameters to process your entity-listed file.
  • a Python package , by installing the package using pip install EntityNormalizer

Input and output

The input file must contain one entity per line. The output file will contain the normalized entities, again, one per line.
The dictionary file must be a comma-separated table file, i.e., csv.

If the entity does not produce any match in the dictionary, it will be normalized to [NO_MATCH]. If the entity is found in the dictionary but the normalization is empty, it will be normalized to [NO_NORM_FOUND].


Command line usage

python main.py input output dictionary source target [--matching_threshold MATCHING_THRESHOLD] [--index]

Parameters

  • input: Input file path [Required]
  • output: Output file path [Required]
  • dictionary: Normalization dictionary file path [Required]
  • source: Surface form column from dictionary [Required]
  • target: Normalization column from dictionary [Required]
  • matching_threshold: Threshold of string similarity for the normalization to be accepted (default: 50) [Optional]
  • index: Use column indexes instead of names [Optional]

Example

  • With column names:

    python main.py data/input.txt data/output.txt data/dictionary.csv surface_form_col normalization_col --matching_threshold 50

  • With integer column indexes:

    python main.py data/input.txt data/output.txt data/dictionary.csv --index source 0 target 2 --matching_threshold 80


Python package usage

After installation, the normalize function can be invoked with the dicitonary and a list of entities to produce a list of normalized entities.

Example

from EntityNormalizer import EntityDictionary, normalize

entities = ['entity1', 'entity2', 'entity3']

normalization_dictionary = EntityDictionary('data/dictionary.csv', 'surface_forms', 'normalizations')
normalized = normalize(entities, normalization_dictionary, matching_threshold=70)

print(normalized)

Bundled dictionaries

This library comes with a set of bundled dictionaries, which can be found under the resources folder:

  • MedDic-CANCER-ADE-JA
  • MedDic-CANCER-DRUG-JA

These are a set of Japanese medical dictionaries developed with normalization of concepts normally found during the analysis of adverse events caused by anticancer drugs. Please refer to this page for mor information.

There are convenient classes for loading these dictionaries, which can be accessed with the Dictionaries module:

from EntityNormalizer import Dictionaries, normalize

entities = ['entity1', 'entity2', 'entity3']

# Load the dictionaries
cancer_ade = Dictionaries.MedDicCancerADE()
cancer_drug = Dictionaries.MedDicCancerDrug()

# Use the dictionaries
normalized_ade = normalize(entities, cancer_ade, matching_threshold=70)
normalized_drug = normalize(entities, cancer_drug, matching_threshold=70)

Both dictionaries use the columns 出現形 (Surface form) and [細分類] (Sub-classification) as source and target columns, respectively.

This can be altered by passing the referring parameter when creating the dictionary:

from EntityNormalizer import Dictionaries

cancer_ade = Dictionaries.MedDicCancerADE(source_column='customColumn', target_column='customColumn2')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

EntityNormalizer-0.2.0.tar.gz (41.8 kB view details)

Uploaded Source

Built Distribution

EntityNormalizer-0.2.0-py3-none-any.whl (41.8 kB view details)

Uploaded Python 3

File details

Details for the file EntityNormalizer-0.2.0.tar.gz.

File metadata

  • Download URL: EntityNormalizer-0.2.0.tar.gz
  • Upload date:
  • Size: 41.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for EntityNormalizer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 afa426276dc9b91ffcf6ebd9e8fb82b636c4942d702eb59508166a5db7522e6f
MD5 cbf1492440ffb616ce669a6fdd4ee5c2
BLAKE2b-256 33c8e1bf2cc5c820f5e0dd530775ec019639f44a458c0a4619d5b9ae05c24896

See more details on using hashes here.

File details

Details for the file EntityNormalizer-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for EntityNormalizer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4fb6c79049f1e3d8d12852aee7910aa8205dc61d1e411a53005f685a1bd383db
MD5 e2edeffe6ee02f3b9f954c3c8b65eb53
BLAKE2b-256 4ab5afd3959b77241974977c7a000691c90f1172ca279986489b7c53b27103a7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page