A python implementation of IAMsystem algorithm
Project description
iamsystem
A python implementation of IAMsystem algorithm, a fast dictionary-based approach for semantic annotation, a.k.a entity linking.
Installation
pip install iamsystem
Usage
You provide a list of keywords you want to detect in a document, you can add and combine abbreviations, normalization methods (lemmatization, stemming) and approximate string matching algorithms, IAMsystem algorithm performs the semantic annotation.
See the documentation for the configuration details.
Quick example
from iamsystem import Matcher
matcher = Matcher.build(
keywords=["North America", "South America"],
stopwords=["and"],
abbreviations=[("amer", "America")],
spellwise=[dict(measure="Levenshtein", max_distance=1)],
w=2,
)
annots = matcher.annot_text(text="Northh and south Amer.")
for annot in annots:
print(annot)
# Northh Amer 0 6;17 21 North America
# south Amer 11 21 South America
Algorithm
The algorithm was developed in the context of a PhD thesis. It proposes a solution to quickly annotate documents using a large dictionary (> 300K keywords) and fuzzy matching algorithms. No string distance algorithm is implemented in this package, it imports and leverages external libraries like spellwise, pysimstring and nltk. Its algorithmic complexity is O(n(log(m))) with n the number of tokens in a document and m the size of the dictionary. The formalization of the algorithm is available in this paper.
The algorithm was initially developed in Java (https://github.com/scossin/IAMsystem). It has participated in several semantic annotation competitions in the medical field where it has obtained satisfactory results, for example by obtaining the best results in the Codiesp shared task. A dictionary-based model can achieve close performance to a transformer-based model when the task is simple or when the training set is small. Its main advantage is its speed, which allows a baseline to be generated quickly.
Citation
@article{cossin_iam_2018,
title = {{IAM} at {CLEF} {eHealth} 2018: {Concept} {Annotation} and {Coding} in {French} {Death} {Certificates}},
shorttitle = {{IAM} at {CLEF} {eHealth} 2018},
url = {http://arxiv.org/abs/1807.03674},
urldate = {2018-07-11},
journal = {arXiv:1807.03674 [cs]},
author = {Cossin, Sébastien and Jouhet, Vianney and Mougin, Fleur and Diallo, Gayo and Thiessard, Frantz},
month = jul,
year = {2018},
note = {arXiv: 1807.03674},
keywords = {Computer Science - Computation and Language},
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file iamsystem-0.5.0.tar.gz
.
File metadata
- Download URL: iamsystem-0.5.0.tar.gz
- Upload date:
- Size: 65.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c613ac3c790dedfa1c136a13189e1f05b53d43c92793596838b175a1f2decbfc |
|
MD5 | 8fa25bccb0151c876d64d6c69b0c2466 |
|
BLAKE2b-256 | a5945cdea1e6291fd05265cc4d5318db3fef61440413ceae547cd57b7ef0927f |
File details
Details for the file iamsystem-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: iamsystem-0.5.0-py3-none-any.whl
- Upload date:
- Size: 56.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92dcc1feedae568e119cf4b0e51391f3fe2ff3e1ed55c392e9f92127994e4f1e |
|
MD5 | f795c5f55783dfcd0489906515dcc075 |
|
BLAKE2b-256 | 017859ab8553b82946e688d4f95b34c65de08f7aedb8badf70225865ef249451 |