Skip to main content

iKnow Natural Language Processing engine

Project description

iKnow

iKnow is a library for Natural Language Processing that identifies entities (phrases) and their semantic context in natural language text in English, German, Dutch, French, Spanish, Portuguese, Swedish, Russian, Ukrainian, Czech and Japanese. It was originally developed by i.Know in Belgium, acquired by InterSystems in 2010 to be embedded in its Caché and IRIS Data Platform products. InterSystems published the iKnow engine as open source in 2020.

Getting started with iKnow

This readme file has everything you need to get started, but make sure you click through to the wiki for more details on any of these subjects.

Using iKnow

From Python

The easiest way to see for yourself what iKnow does with text is by giving it a try! Thanks to our Python interface, that only takes two simple steps:

  1. Use pip to install the iknowpy module as follows:

    pip install iknowpy
    
  2. From your Python prompt, instantiate the engine and start indexing:

    import iknowpy
    
    engine = iknowpy.iKnowEngine()
    
    # show supported languages
    print(engine.get_languages_set())
    
    # index some text
    text = 'This is a test of the Python interface to the iKnow engine.'
    engine.index(text, 'en')
    
    # print the raw results
    print(engine.m_index)
    
    # or make it a little nicer
    for s in engine.m_index['sentences']:
        for e in s['entities']:
            print('<'+e['type']+'>'+e['index']+'</'+e['type']+'>', end=' ')
        print('\n')
    

If you are looking for another programming language or interface, check out the other APIs. For more on the Python interface, move on to the Getting Started section in the wiki!

From C++

The main C++ API file is engine.h, defining the class iKnowEngine with the main entry point:

index(TextSource, language)

After indexing all data is stored in iknowdata::Text_Source m_index. "iknowdata" is the namespace used for all classes that contain output data. Fore more details, please refer to the API overview on the wiki.

From InterSystems IRIS

For many years, the iKnow engine has been available as an embedded service on the InterSystems IRIS Data Platform. The obvious advantage of packaging it with a database is that indexing results from many documents can be stored in a single repository, enabling corpus-wide analytics through practical APIs. See the iKnow documentation for IRIS or browse the InterSystems Developer Community's articles on setting up an iKnow domain, browsing it and using iFind (iKnow-powered text search)

The InterSystems IRIS Community Edition is available from Docker Hub free of charge.

From Different Platforms

Since version 1.3, a C-interface is available, enabling communication with the iKnow engine in a JSON encoded request/response style:

const char* j_response;
iknow_json(R"({"method" : "index", "language" : "en", "text_source" : "Hello World"})", &j_response);

Most API functionality is available in a serialized json format.

Understanding iKnow

Entities

iKnow identifies phrase boundaries that define Entities, entirely based on the syntactic structure of the sentences, rather than relying on an upfront dictionary or pretrained model. This makes iKnow well-suited for initial exploration of a new corpus. iKnow Entities are not Named Entities in the NER sense, but rather the word groups that need to be considered together, representing a concept or relationship as coined by the text author in its entirety. The following examples clearly show the importance of this phrase level to fully capture what the author meant:

iKnow Entity Meaning
Dopamine small molecule
Dopamine receptor drug target
Dopamine receptor antagonist chemical drug
Dopamine receptor gene gene, molecular sequence
Dopamine receptor gene mutation physiological process

iKnow will label every entity with a simple role that is either concept (usually corresponding to Noun Phrases in POS lingo) or relation (verbs, prepositions, ...). Typical stop words that have little meaning of their own get categorized as PathRelevant (e.g. pronouns) or NonRelevant parts, depending on whether they play a role in the sentence structure or are just linguistic fodder.

In the following sample sentence, we've highlighted concepts, relations and PathRelevants separately.

Belgian geuze is well-known across the continent for its delicate balance.

Read more...

CRC's

As of v1.4, the iKnow engine now also produces Concept-Relation-Concept clusters (aka CRC's)

Read more...

Attributes

Beyond this simple phrase recognition, iKnow also captures the context of these entities through semantic attributes. Attributes label spans (of entities) within a sentence that share a semantic context. Most attributes start from a marker term and are then, through linguistic rules, expanded left and right as appropriate per the syntactic structure of the sentence. iKnow's main contribution is in this fine-grained expansion, which has been shown to be more accurate than many ML-based techniques.

iKnow supports the following attribute types:

  • Negation: iKnow tags all entities participating in a negation, as opposed to an (implied) affirmative context.

    After discussing his nausea, the [patient didn't report suffering from chest pain, shortness of breath or tickling].

  • Sentiment: based on a user-supplied list of marker terms, iKnow will identify spans with either a positive or negative sentiment (through separate attributes). Overlapping negation attributes will reverse the sentiment in some language models.

    [ I liked the striped pijamas], but the [slippers didn't really fit with it ].

  • Measurements, Time, Frequency and Duration: all entities "participating" in an expression of something measurable or time-related will be tagged, enabling efficient recognition of facts in long stretches of natural language text.

    Upon exam [two weeks ago] the [patient's weight was 146.5 pounds].

  • Certainty: this attribute is a work in progress. See the corresponding wiki section for more details.

Some attributes are not available for all languages yet. See the wiki section for more details.

How it works

Some InterSystems-era resources on how iKnow works:

Read more...

Building the iKnow Engine

The source code for the iKnow engine is written in C++ and includes .sln files for building with Microsoft Visual Studio 2019 Community Edition and Makefiles for building in Linux/Unix.

Please refer to this wiki page for more on the overall build process.

Contributing to iKnow

You are welcome to contribute to iKnow's engine code and language models. Check out the Wiki for more details on how they work and the Issues and Projects sections for any particular work on the horizon.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

iknowpy-1.5.3-cp38.cp39.cp310.cp311-cp38.cp39.cp310.cp311-macosx_10_9_universal2.whl (79.2 MB view details)

Uploaded CPython 3.10 CPython 3.11 CPython 3.8 CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64)

iknowpy-1.5.3-cp37.cp38.cp39.cp310.cp311-cp37m.cp38.cp39.cp310.cp311-win_amd64.whl (37.9 MB view details)

Uploaded CPython 3.10 CPython 3.11 CPython 3.7m CPython 3.8 CPython 3.9 Windows x86-64

iknowpy-1.5.3-cp37.cp38.cp39.cp310.cp311-cp37m.cp38.cp39.cp310.cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.8 MB view details)

Uploaded CPython 3.10 CPython 3.11 CPython 3.7m CPython 3.8 CPython 3.9 manylinux: glibc 2.17+ x86-64

iknowpy-1.5.3-cp37.cp38.cp39.cp310.cp311-cp37m.cp38.cp39.cp310.cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (48.1 MB view details)

Uploaded CPython 3.10 CPython 3.11 CPython 3.7m CPython 3.8 CPython 3.9 manylinux: glibc 2.17+ ppc64le

iknowpy-1.5.3-cp37.cp38.cp39.cp310.cp311-cp37m.cp38.cp39.cp310.cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (42.5 MB view details)

Uploaded CPython 3.10 CPython 3.11 CPython 3.7m CPython 3.8 CPython 3.9 manylinux: glibc 2.17+ ARM64

iknowpy-1.5.3-cp37-cp37m-macosx_10_9_x86_64.whl (39.4 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

File details

Details for the file iknowpy-1.5.3-cp38.cp39.cp310.cp311-cp38.cp39.cp310.cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for iknowpy-1.5.3-cp38.cp39.cp310.cp311-cp38.cp39.cp310.cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 7df000a226b7bbc411c080e14388dccd1880118f1a7762415dee1d139fac7d2f
MD5 5b7a4377077dc8a45cc065128cd76ca8
BLAKE2b-256 193ac8043f798bbdd7f2b868f1014bd4d7a9bd121c33d76ec9ecf45b396dd1ce

See more details on using hashes here.

File details

Details for the file iknowpy-1.5.3-cp37.cp38.cp39.cp310.cp311-cp37m.cp38.cp39.cp310.cp311-win_amd64.whl.

File metadata

File hashes

Hashes for iknowpy-1.5.3-cp37.cp38.cp39.cp310.cp311-cp37m.cp38.cp39.cp310.cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7f8efd8334e704f30e391fb5f4904fdb3b71aed2b81e0e75a2a458ee0ef45462
MD5 e961f3c3979a13e49cbfe6cb1f216cfe
BLAKE2b-256 8b7770314e97bf6c9d5bf6cd841d20b5d23c32a064618501b0cf2801ff207148

See more details on using hashes here.

File details

Details for the file iknowpy-1.5.3-cp37.cp38.cp39.cp310.cp311-cp37m.cp38.cp39.cp310.cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for iknowpy-1.5.3-cp37.cp38.cp39.cp310.cp311-cp37m.cp38.cp39.cp310.cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 10ee4a169f139d15b9684ae3f2a3b3ec2484dd63ceebc8431b95969edc9d5963
MD5 56a8870999d0de56dfae0d2167f6a7a8
BLAKE2b-256 69cc495c3a29e56f77dd861978b9bf2f29df9cdabefa8d67c658d6474a9e7542

See more details on using hashes here.

File details

Details for the file iknowpy-1.5.3-cp37.cp38.cp39.cp310.cp311-cp37m.cp38.cp39.cp310.cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for iknowpy-1.5.3-cp37.cp38.cp39.cp310.cp311-cp37m.cp38.cp39.cp310.cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 30c3987aa0f97af66800b2e15165204b3d612d095176ca9ee2ad78da3b0ae98f
MD5 86bbda480dcc6b5dfdfd6ed81eb8bc5e
BLAKE2b-256 e747d0ebeb8b65c338a8fbd5b0049849422be1ea51531a0fc551aa022d4d7eee

See more details on using hashes here.

File details

Details for the file iknowpy-1.5.3-cp37.cp38.cp39.cp310.cp311-cp37m.cp38.cp39.cp310.cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for iknowpy-1.5.3-cp37.cp38.cp39.cp310.cp311-cp37m.cp38.cp39.cp310.cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 df66196cd0ce54817baf62a5b325b9532c25f8d17ef670f05304be9803fc36c3
MD5 e55df084f5b0f6476a2a2a894aff6233
BLAKE2b-256 95034535bea1663e93c29912f6e9c13bf4c034360dd5a4ea06a0f5fea8ed064a

See more details on using hashes here.

File details

Details for the file iknowpy-1.5.3-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for iknowpy-1.5.3-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e272f6b87f44f5eb7055abace26c10b67a6aaeda8ee807cc14f6ad0308ce830a
MD5 bed8f57e7927470778465186c9138666
BLAKE2b-256 71505cb8560c6944ccf9374152e2a9d6743759cbcd944138722e90ad7b98adce

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page