Skip to main content

Find personally identifiable information in German texts using NER and rule based matching.

Project description

NERwhal

A multi-lingual suite for named-entity recognition in Python.


PyPI version Tests Black & Flake8 MIT license Code style: Black

:warning: Disclaimer :warning:: This is a prototype. Do not use for anything critical.

Description

NERwhal's mission is to make defining custom recognizers for different NER approaches as easy as possible. To achieve this, different NER backends are implemented behind a unified API. Each recognizer is based on one of the backends. Users can detect named entities by implementing custom recognizers for one or more of the backends.

Check out our blog post about NERwhal on Medium.

Powerful NER backends

NERwhal makes use of some of the most powerful NER platforms out there:

  • Regular expressions: Using regular expressions you can define a named entity as a set of strings.
  • Entity Ruler: spaCy’s Entity Ruler lets you define patterns for sequences of tokens. (spaCy is also used for tokenization)
  • FlashText: The FlashText Algorithm can search texts very efficiently for long lists of keywords.
  • Deep Learning: The Stanza library and models (which provide state-of-the-art results for NER in many languages) power NERwhal's statistical recognition. Currently, Stanza supports NER for 8 languages.

Smart combination of the results

The suite can combine the results of these methods in a smart way to get best results. E.g. a match with a higher score can overwrite a lower scored one, or, if one entity was identified several times, its confidence score can be increased.

Context words

Each recognizer can define a list of context words that may occur in the context of named entities. If a context word is found in the same sentence as the entity, the confidence score is increased.

flowchart

Integrated recognizers

NERwhal follows the philosophy that recognizers are specific to the language, use case, and requirements. The recommended way to use is to define your own custom recognizers. Yet to exemplify its usage and to help you bootstrap your own recognition suite, some example recognizers are implemented in nerwhal/integrated_recognizers. Please refer to each recognizers' PyDoc for more information, and keep in mind that none of these recognizers will catch all occurrences of their category, and that they may produce false positives results.

Installation

NERwhal can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

pip install nerwhal

Usage

To recognize named entities, pass a text and config object to the recognize method. Select the recognizers to be used in the config object.

>>> from nerwhal import recognize, Config
>>>
>>> config = Config(language="de", use_statistical_ner=True, recognizer_paths=["nerwhal/integrated_recognizers/email_recognizer.py"])
>>>
>>> recognize("Ich heiße Luke und meine E-Mail ist luke@skywalker.com.", config=config, return_tokens=True)
{
    'tokens': [
        Token(text='Ich', has_ws=True, br_count=0, start_char=0, end_char=3),
        Token(text='heiße', has_ws=True, br_count=0, start_char=4, end_char=9),
        ...
        Token(text='.', has_ws=False, br_count=0, start_char=54, end_char=55)
    ],
    'ents': [
        NamedEntity(start_char=10, end_char=14, tag='PER', text='Luke', score=0.8, recognizer='StanzaNerBackend', start_tok=2, end_tok=3),
        NamedEntity(start_char=36, end_char=54, tag='EMAIL', text='luke@skywalker.com', score=0.95, recognizer='EmailRecognizer', start_tok=7, end_tok=8)
    ]
}

Implementing custom recognizers

To implement a custom recognizer, you have to implement one of the interfaces in recognizer_bases. For examples see one of the integrated_recognizers.

Development

Install requirements

You can install all (production and development) requirements using:

pip install -r requirements.txt

Install the pre-commit hooks

This repository uses git hooks to validate code quality and formatting.

pre-commit install
git config --bool flake8.strict true  # Makes the commit fail if flake8 reports an error

To run the hooks:

pre-commit run --all-files

Testing

Run all tests with:

pytest --cov-report term --cov=nerwhal

To skip tests that require the download of Stanza models run:

pytest -m "not stanza"

How to contact us

For usage questions, bugs, or suggestions please file a Github issue. If you would like to contribute or have other questions please email hello@openredact.org.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nerwhal-0.1.0a0.tar.gz (26.7 kB view details)

Uploaded Source

Built Distribution

nerwhal-0.1.0a0-py3-none-any.whl (35.4 kB view details)

Uploaded Python 3

File details

Details for the file nerwhal-0.1.0a0.tar.gz.

File metadata

  • Download URL: nerwhal-0.1.0a0.tar.gz
  • Upload date:
  • Size: 26.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for nerwhal-0.1.0a0.tar.gz
Algorithm Hash digest
SHA256 d4f3e70d2c078ade282bb2779acb69edc045312e2dabe700f8f99bc53fecbe28
MD5 846c480b00347bb4d979db9a188ab96f
BLAKE2b-256 e47a9c55ebb052d4cefd827cdc3e01b91f21111a2c5a8a7277f0984e18d2359c

See more details on using hashes here.

File details

Details for the file nerwhal-0.1.0a0-py3-none-any.whl.

File metadata

  • Download URL: nerwhal-0.1.0a0-py3-none-any.whl
  • Upload date:
  • Size: 35.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for nerwhal-0.1.0a0-py3-none-any.whl
Algorithm Hash digest
SHA256 c42eba99ad38980bc3c76715f1ef03b01713584398cd26020a42ae6e30ee91de
MD5 68d23a4272b9a9b5d64a3285be93ec47
BLAKE2b-256 209facfc9bf5237871937ec88018d95654968633ffd2713d7a42df4d082618d3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page