Skip to main content

A python implementation of IAMsystem algorithm

Project description

iamsystem

test Linux PyPI version fury.io PyPI license PyPI pyversions Code style: black

A python implementation of IAMsystem algorithm, a fast dictionary-based approach for semantic annotation, a.k.a entity linking.

Installation

pip install iamsystem

Usage

You provide a list of keywords you want to detect in a document, you can add and combine abbreviations, normalization methods (lemmatization, stemming) and approximate string matching algorithms, IAMsystem algorithm performs the semantic annotation.

See the documentation for the configuration details.

Quick example

from iamsystem import Matcher, Abbreviations, SpellWiseWrapper,\
    ESpellWiseAlgo
matcher = Matcher()
# add a list of words to detect
matcher.add_labels(labels=["North America", "South America"])
matcher.add_stopwords(words=["and"])
# add a list of abbreviations (optional)
abbs = Abbreviations(name="common abbreviations")
abbs.add(short_form="amer", long_form="America", tokenizer=matcher)
matcher.add_fuzzy_algo(fuzzy_algo=abbs)
# add a string distance algorithm (optional)
levenshtein = SpellWiseWrapper(
    ESpellWiseAlgo.LEVENSHTEIN, max_distance=1
)
levenshtein.add_words(words=matcher.get_keywords_unigrams())
matcher.add_fuzzy_algo(fuzzy_algo=levenshtein)
# perform semantic annotation:
annots = matcher.annot_text(text="Northh and south Amer.", w=2)
for annot in annots:
    print(annot)
# Northh Amer	0 6;17 21	North America
# south Amer	11 21	South America

Algorithm

The algorithm was developed in the context of a PhD thesis. It proposes a solution to quickly annotate documents using a large dictionary (> 300K keywords) and fuzzy matching algorithms. No string distance algorithm is implemented in this package, it imports and leverages external libraries like spellwise and nltk. Its algorithmic complexity is O(n(log(m))) with n the number of tokens in a document and m the size of the dictionary. The formalization of the algorithm is available in this paper.

The algorithm was initially developed in Java (https://github.com/scossin/IAMsystem) and has participated in several semantic annotation competitions in the medical domain where it has obtained very satisfactory results.

Citation

@article{cossin_iam_2018,
	title = {{IAM} at {CLEF} {eHealth} 2018: {Concept} {Annotation} and {Coding} in {French} {Death} {Certificates}},
	shorttitle = {{IAM} at {CLEF} {eHealth} 2018},
	url = {http://arxiv.org/abs/1807.03674},
	urldate = {2018-07-11},
	journal = {arXiv:1807.03674 [cs]},
	author = {Cossin, Sébastien and Jouhet, Vianney and Mougin, Fleur and Diallo, Gayo and Thiessard, Frantz},
	month = jul,
	year = {2018},
	note = {arXiv: 1807.03674},
	keywords = {Computer Science - Computation and Language},
}

Changelog

0.1.1

  • Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iamsystem-0.1.1.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

iamsystem-0.1.1-py3-none-any.whl (42.4 kB view details)

Uploaded Python 3

File details

Details for the file iamsystem-0.1.1.tar.gz.

File metadata

  • Download URL: iamsystem-0.1.1.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.7

File hashes

Hashes for iamsystem-0.1.1.tar.gz
Algorithm Hash digest
SHA256 90b771aa228048dee56bc4b52ac4b43b85695481f97b8473a735b261561ad001
MD5 c07894d1ecc23169155832a282475073
BLAKE2b-256 2261f42423bff6f796419211b7887698ddec1e81d504c6d33202fb443de8e36f

See more details on using hashes here.

File details

Details for the file iamsystem-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: iamsystem-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 42.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.7

File hashes

Hashes for iamsystem-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c8a622c6c5a5582f4caed4f4c93ef70c5e0c4bccf42a800b570b735424cd939b
MD5 e3f4709c8177d5a75979d0f1204b520c
BLAKE2b-256 e6dc56c653d73e2a4321a6705f9eb33cffe6434f544cc29e7c4c733b752a1a45

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page