Skip to main content

Utility for dictionary-based named entity recognition

Project description

pilsner

Python implemented library servicing named entity recognition

pypi

1. Purpose

This library is Python implementation of toolkit for dictionary based named entity recognition. It is intended to store any thesaurus in a trie-like structure and identify any of stored synonyms in a string.

2. Installation and dependencies

pip install pilsner

pilsner is tested in Python 3.6, 3.7, and 3.8.

The only dependency is sic package. While it can be automatically installed at the time of pilsner installation, manual installation of sic beforehand might also be considered (see benchmark of cythonized vs pure Python implementation in sic docimentation, https://pypi.org/project/sic/).

3. Diagram

pilsner consists of two major components: Model and Utility. Model class provides storage for the dictionary and string normalization rules, as well as low-level methods for populating this storage. Utility class provides high-level methods for storing and retrieving data to/from Model instance.

Diagram

4. Usage

import pilsner

4.1. Initialize model

  • To initialize empty model:
m = pilsner.Model()
  • To specify path to temporary database for empty model:
m = pilsner.Model(storage_location='path/to/database.file')
  • To create empty model that uses database created in memory rather than on disk:
m = pilsner.Model(storage_location=':memory:')
  • To create empty model that does not store any attributes in a database at all:
m = pilsner.Model(simple=True)

If database is created in memory, the model cannot be later saved on disk (can only be used instantly).

  • To load model from disk:
m = pilsner.Model(filename='path/to/model')

More on how model is saved to and loaded from disk - see 4.6. Save model and 4.7. Load model.

4.2. Add string normalization units

  • Depending on the dictionary and nature of the text supposed to be parsed, string normalization might not be required at all, and nothing specific is to be done here in such case.
  • Without string normalization, synonyms from the dictionary will be stored as they are and looked up by recognizer case-sensitively.
  • To add a single normalization unit:
# Assuming m is pilsner.Model instance:
m.add_normalizer(
    normalizer_name='normalizer_tag',
    filename='path/to/normalizer_config.xml'
)

String normalization is technically done by sic component. See documentation for sic at https://pypi.org/project/sic/ to learn how to design normalizer config.

  • Model can embed more than one normalization unit.
  • Default normalization unit for the model is the one added first or the last one added with parameter default set to True.
  • Having multiple normalization units in one model makes perfect sense when the stored dictionary contains synonyms of different nature that should be normalized in different ways (for example, abbreviations probably should not get normalized at all, while other synonyms might include tokens or punctuation marks that should not affect entity recognition). For that purpose, Model class includes normalizer_map dict that is supposed to map names of added normalization units to values in specific field in a dictionary designating the way a synonym should be normalized (tokenizer field, or tokenizer column):
# Assuming m is pilsner.Model instance:
m.normalizer_map = {
    'synonym_type_1': 'normalizer_1',
    'synonym_type_2': 'normalizer_2'
}

The snippet above instructs pilsner to normalize synonyms that have synonym_type_1 value in tokenizer column with normalizer_1 normalization unit, and normalize synonyms that have synonym_type_2 value in tokenizer column with normalizer_2 normalization unit. For more about fields in a dictionary, see 4.4. Define dictionary.

4.3. Initialize utility

  • To load dictionary into Model instance, as well as to parse text, the Utility instance is required:
r = pilsner.Utility()

4.4. Define dictionary

  • Source dictionary for pilsner must be delimited text file.
  • Along with the source dictionary, specifications of the columns (fields) must be provided as list where each item corresponds to a column (from left to right). Each item in this list must be a dict object with string keys name, include, delimiter, id_flag, normalizer_flag, and value_flag, so that:
    • field['name'] is a string for column title;
    • field['include'] is a boolean that must be set to True for the column to be included in the model, otherwise False;
    • field['delimiter'] is a string that is supposed to split single cell into list of values if the column holds concatenated lists rather than individual values;
    • field['id_flag] is a boolean that must be set to True if the column is supposed to be used for grouping synonyms (generally, entity ID is such column), otherwise False;
    • field['normalizer_flag'] is a boolean that must be set to True if the column holds indication on what normalization unit must be applied to this particular synonym, otherwise False;
    • field['value_flag'] is a boolean that must be set to True if the column holds synonyms that are supposed to be looked up when parsing a text, otherwise False.

If dictionary has a column flagged with normalizer_flag, synonym in each row will be normalized with string normalization unit which name is mapped on value in this column using pilsner.Model.normalizer_map dict. If value is not among pilsner.Model.normalizer_map keys, default normalization unit will be used.

4.5. Compile model

  • To store dictionary in Model instance, method compile_model of Utility instance must be called with the following required parameters:
    • model: pointer to initilized Model instance;
    • filename: string with path and filename of source dictionary;
    • fields: dict object with definitions of columns (see 4.4. Define dictionary);
    • word_separator: string defining what is to be considered word separator (generally, it should be whitespace);
    • column_separator: string defining what is to be considered column separator (e.g. \t for tab-delimited file);
    • column_enclosure: string defining what is to be stripped away from cell after row has been split into columns (typically, it should be \n for new line character to be trimmed from the rightmost column).
# Assuming m is pilsner.Model instance and r is pilsner.Utility instance:
r.compile_model(
    model=m,
    filename='path/to/dictionary_in_a_text_file.txt',
    fields=fields,
    word_separator=' ',
    column_separator='\t',
    column_enclosure='\n'
)
  • To review optional parameters, see comments in the code.

4.6. Save model

  • If Model instance has compiled dictionary, and if database location for the Model instance is not explicitly set to ':memory:', the data such instance is holding can be saved to disk:
# Assuming m is pilsner.Model instance
m.save('path/to/model_name')
  • The snippet above will write the following files:
    • path/to/model_name.attributes: database with attributes (fields from the dictionary that are not synonyms) - will only be written if Model instance is not created with simple=True parameter;
    • path/to/model_name.keywords: keywords used for disambiguation;
    • path/to/model_name.normalizers: string normalization units;
    • path/to/model_name.0.dictionary: trie with synonyms;
    • path/to/model_name.<N>.dictionary: additional tries with synonyms (<N> being integer number of a trie) in case more than one trie was created (see comments in the code - pilsner.Utility.compile_model method, item_limit parameter).

4.7. Load model

  • To initialize new Model instance using previously saved data:
m = pilsner.Model(filename='path/to/model_name')
  • Alternatively, data can be loaded to previously initialized Model instance:
m = pilsner.Model()
m.load('path/to/model_name')
  • In both cases, the program will look for the following files:
    • path/to/model_name.attributes: database with attributes (fields from the dictionary that are not synonyms) - if not found, Model instance will work as if it is initialized with simple=True parameter, meaning no attributes other than primary IDs could be processed;
    • path/to/model_name.keywords: keywords used for disambiguation;
    • path/to/model_name.normalizers: string normalization units;
    • path/to/model_name.<N>.dictionary: tries with synonyms (<N> being integer).

4.8. Parse string

  • To parse a string without filtering out any synonyms and output all attributes of spotted entities:
# Assuming m is pilsner.Model instance, r is pilsner.Utility instance,
# and text_to_parse is string to parse
parsed = r.parse(
    model=m,
    source_string=text_to_parse
)
  • The output will be dict object where keys are tuples for location of spotted entity in a string (begin, end) and values are dicts for attributes that are associated with identified entity ({'attribute_name': {attribute_values}}).
  • To ignore entity by its label rather than some of its attributes, compiled model can be adjusted using pilsnet.Utility.ignore_node() method:
# Assuming m is pilsner.Model instance, r is pilsner.Utility instance
r.ignore_node(
  model=m,
  label='irrelevant substring'
)
# substring 'irrelevant substring' will not be found by pilsner.Utility.parse()
# even if it is present in the model
  • For details about optional parameters, see comments in the code - pilsner.Utility.parse() function.

5. Example

Everything written above is put together in example code, see /misc/example/ directory in the project's repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pilsner-0.1.0.tar.gz (22.1 kB view details)

Uploaded Source

Built Distributions

pilsner-0.1.0-cp39-cp39-win_amd64.whl (529.4 kB view details)

Uploaded CPython 3.9 Windows x86-64

pilsner-0.1.0-cp38-cp38-win_amd64.whl (530.6 kB view details)

Uploaded CPython 3.8 Windows x86-64

pilsner-0.1.0-cp37-cp37m-win_amd64.whl (523.2 kB view details)

Uploaded CPython 3.7m Windows x86-64

pilsner-0.1.0-cp36-cp36m-win_amd64.whl (523.2 kB view details)

Uploaded CPython 3.6m Windows x86-64

File details

Details for the file pilsner-0.1.0.tar.gz.

File metadata

  • Download URL: pilsner-0.1.0.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.6

File hashes

Hashes for pilsner-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7bf1a2afb8e3a66e43aad5a8f9e31a0829d072fdd9957ea98075613ea6fb7681
MD5 db9a73fa4b103a156537b724340ef715
BLAKE2b-256 a1f2a7884222a5ec1f66d925b8887825f9694c224c373cdbef1e83d23a0e8d35

See more details on using hashes here.

File details

Details for the file pilsner-0.1.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: pilsner-0.1.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 529.4 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.6

File hashes

Hashes for pilsner-0.1.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 a0c04de0a2e3a72f22c52eb70c8c391dc611180d5f6e4b6e608d5b87768a490b
MD5 0f3be0bd1b6f890d204792acb5ca1c13
BLAKE2b-256 034700198e040bb51fa49d0e0627c35369caeeed64c77daa1e10c144202aa7f2

See more details on using hashes here.

File details

Details for the file pilsner-0.1.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: pilsner-0.1.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 530.6 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.6

File hashes

Hashes for pilsner-0.1.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 4164c94c4d8ce56fdc167f6ed56a8a466afdebf6fe19f2af94da414ab5596a56
MD5 a50cb0e14ceeb136630ff71a46980fc4
BLAKE2b-256 2aaaf0a531d241115a5130c5cabf2830e16d3ee23e9bc5b87edad4903b424de2

See more details on using hashes here.

File details

Details for the file pilsner-0.1.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: pilsner-0.1.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 523.2 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.6

File hashes

Hashes for pilsner-0.1.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 12a34a757d8d90cde10f82bcf68bc4839a841d52a4712f7ac147dd57719d22ef
MD5 092c31c4357a464fd07ca3d4272dd754
BLAKE2b-256 e4f1aba9e1ccd1e28f0190361f05049ae5ec22e56978cecd7a15103fc84836f1

See more details on using hashes here.

File details

Details for the file pilsner-0.1.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: pilsner-0.1.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 523.2 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.6

File hashes

Hashes for pilsner-0.1.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 f26552a2258536df70bcea9af4c45a097b1c77af83fd9fbd1757029c17724368
MD5 fafc19f6ddb6ca6b80bdc07a5389c528
BLAKE2b-256 af26c2313b8c35e0793aa038f20a731ff93f2206e583eb95a9a2554fbb409a33

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page