spacy-ann-linker

spaCy ANN Linker, a pipeline component for generating spaCy KnowledgeBase Alias Candidates for Entity Linking.

These details have not been verified by PyPI

Project links

Project description

spaCy ANN Linker, a pipeline component for generating spaCy KnowledgeBase Alias Candidates for Entity Linking.

Documentation: https://microsoft.github.io/spacy-ann-linker

Source Code: https://github.com/microsoft/spacy-ann-linker

spaCy ANN Linker is a spaCy a pipeline component for generating alias candidates for spaCy entities in doc.ents. It provides an optional interface for linking ambiguous aliases based on descriptions for each entity.

The key features are:

Easy spaCy Integration: spaCy ANN Linker provides completely serializable spaCy pipeline components that integrate directly into your existing spaCy model.
CLI for simple Index Creation: Simply run spacy_ann create_index with your data to create an Approximate Nearest Neighbors index from your data, make an ann_linker pipeline component and save a spaCy model.
Built in Web API for easy deployment and Batch Entity Linking queries

Requirements

Python 3.6+

spaCy ANN Linker is convenient wrapper built on a few comprehensive, high-performing packages.

Installation

$ pip install spacy-ann-linker
---> 100%
Successfully installed spacy-ann-linker

Data Prerequisites

To use this spaCy ANN Linker you need pre-existing Knowledge Base data. spaCy ANN Linker expects data to exist in 2 JSONL files together in a directory

kb_dir
│   aliases.jsonl
│   entities.jsonl

For testing the package, you can use the example data in examples/tutorial/data

examples/tutorial/data
│   aliases.jsonl
│   entities.jsonl

entities.jsonl Record Format

{"id": "Canonical Entity Id", "description": "Entity Description used for Disambiguation"}

Example data

{"id": "a1", "description": "Machine learning (ML) is the scientific study of algorithms and statistical models..."}
{"id": "a2", "description": "ML (\"Meta Language\") is a general-purpose functional programming language. It has roots in Lisp, and has been characterized as \"Lisp with types\"."}
{"id": "a3", "description": "Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data."}
{"id": "a4", "description": "Neuro-linguistic programming (NLP) is a pseudoscientific approach to communication, personal development, and psychotherapy created by Richard Bandler and John Grinder in California, United States in the 1970s."}
...

aliases.jsonl Record Format

{"alias": "alias string", "entities": ["list", "of", "entity", "ids"], "probabilities": [0.5, 0.5]}

Example data

{"alias": "ML", "entities": ["a1", "a2"], "probabilities": [0.5, 0.5]}
{"alias": "Machine learning", "entities": ["a1"], "probabilities": [1.0]}
{"alias": "Meta Language", "entities": ["a2"], "probabilities": [1.0]}
{"alias": "NLP", "entities": ["a3", "a4"], "probabilities": [0.5, 0.5]}
{"alias": "Natural language processing", "entities": ["a3"], "probabilities": [1.0]}
{"alias": "Neuro-linguistic programming", "entities": ["a4"], "probabilities": [1.0]}
...

spaCy prerequisites

If you don't have a pretrained spaCy model, download one now. The model needs to have vectors so download a model bigger than en_core_web_sm

$ spacy download en_core_web_md
---> 100%
Successfully installed en_core_web_md

Usage

Once you have your data, and a spaCy model with vectors, compute the nearest neighbors index for your Aliases.

Run the create_index help command to understand the required arguments.

$ spacy_ann create_index --help 
spacy_ann create_index --help
Usage: spacy_ann create_index [OPTIONS] MODEL KB_DIR OUTPUT_DIR

  Create an ApproxNearestNeighborsLinker based on the Character N-Gram TF-
  IDF vectors for aliases in a KnowledgeBase

  model (str): spaCy language model directory or name to load kb_dir (Path):
  path to the directory with kb entities.jsonl and aliases.jsonl files
  output_dir (Path): path to output_dir for spaCy model with ann_linker pipe

  kb File Formats

  e.g. entities.jsonl

  {"id": "a1", "description": "Machine learning (ML) is the scientific study
  of algorithms and statistical models..."} {"id": "a2", "description": "ML
  ("Meta Language") is a general-purpose functional programming language. It
  has roots in Lisp, and has been characterized as "Lisp with types"."}

  e.g. aliases.jsonl {"alias": "ML", "entities": ["a1", "a2"],
  "probabilities": [0.5, 0.5]}

Options:
  --new-model-name TEXT
  --cg-threshold FLOAT
  --n-iter INTEGER
  --verbose / --no-verbose
  --install-completion      Install completion for the current shell.
  --show-completion         Show completion for the current shell, to copy it
                            or customize the installation.
  --help                    Show this message and exit.

Now provide the required arguments. I'm using the example data but at this step use your own. the create_index command will run a few steps and you should see an output like the one below.

spacy_ann create_index en_core_web_md examples/tutorial/data examples/tutorial/models

// The create_index command runs a few steps

// Load the model passed as the first positional argument (en_core_web_md)
===================== Load Model ======================
⠹ Loading model en_core_web_md✔ Done.
ℹ 0 entities without a description

// Train an EntityEncoder on the descriptions of each Entity
================= Train EntityEncoder =================
⠸ Starting training EntityEncoder✔ Done Training

// Apply the EntityEncoder to get the final vectors for each entity
================= Apply EntityEncoder =================
⠙ Applying EntityEncoder to descriptions✔ Finished, embeddings created
✔ Done adding entities and aliases to kb

// Create Nearest Neighbors index from the Aliases in kb_dir/aliases.jsonl
================== Create ANN Index ===================
Fitting tfidf vectorizer on 6 aliases
Fitting and saving vectorizer took 0.012949 seconds
Finding empty (all zeros) tfidf vectors
Deleting 2/6 aliases because their tfidf is empty
Fitting ann index on 4 aliases

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Fitting ann index took 0.030826 seconds

Using the saved model

Now that you have a trained spaCy ANN Linker component you can load the saved model from output_dir and run it just like you would any normal spaCy model.

import spacy
from spacy.tokens import Span

# Load the spaCy model from the output_dir you used
# from the create_index command
model_dir = "examples/tutorial/models/ann_linker"
nlp = spacy.load(model_dir)

# The NER component of the en_core_web_md model doesn't actually
# recognize the aliases as entities so we'll add a 
# spaCy EntityRuler component for now to extract them.
ruler = nlp.create_pipe('entity_ruler')
patterns = [
    {"label": "SKILL", "pattern": alias}
    for alias in nlp.get_pipe('ann_linker').kb.get_alias_strings()
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before="ann_linker")

doc = nlp("NLP is a subset of Machine learning.")

print([(e.text, e.label_, e.kb_id_) for e in doc.ents])

# Outputs:
# [('NLP', 'SKILL', 'a3'), ('Machine learning', 'SKILL', 'a1')]
#
# In our entities.jsonl file
# a3 => Natural Language Processing
# a1 => Machine learning

License

This project is licensed under the terms of the MIT license.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.3

Nov 19, 2020

0.3.2

Jul 20, 2020

0.3.1

Jul 20, 2020

0.3.0

Jul 16, 2020

0.2.0

Jul 16, 2020

0.1.10

Mar 2, 2020

0.1.9

Mar 2, 2020

0.1.6

Mar 2, 2020

0.1.5

Feb 27, 2020

0.1.4

Feb 24, 2020

0.1.3

Feb 19, 2020

0.1.2

Feb 14, 2020

0.1.0

Feb 13, 2020

0.0.8

Feb 13, 2020

0.0.6

Feb 10, 2020

0.0.5

Feb 10, 2020

0.0.3

Feb 10, 2020

This version

0.0.2

Feb 7, 2020

0.0.1

Jan 28, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy-ann-linker-0.0.2.tar.gz (170.7 kB view details)

Uploaded Feb 7, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spacy_ann_linker-0.0.2-py3-none-any.whl (16.0 kB view details)

Uploaded Feb 7, 2020 Python 3

File details

Details for the file spacy-ann-linker-0.0.2.tar.gz.

File metadata

Download URL: spacy-ann-linker-0.0.2.tar.gz
Upload date: Feb 7, 2020
Size: 170.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.22.0

File hashes

Hashes for spacy-ann-linker-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`c0537182bfc4eb7c100dbc3f979c515168bc04d835abe4b8d57cdf529143bf33`
MD5	`39c8ab499dad98a99da054090a2ad626`
BLAKE2b-256	`ff4b9b131ac7b2b312c3c2ab9b8bada3ba7b5e292e3cea6a323b2212287879df`

See more details on using hashes here.

File details

Details for the file spacy_ann_linker-0.0.2-py3-none-any.whl.

File metadata

Download URL: spacy_ann_linker-0.0.2-py3-none-any.whl
Upload date: Feb 7, 2020
Size: 16.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.22.0

File hashes

Hashes for spacy_ann_linker-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b27dfab1e26968812b9a0c2502e51cb440c9ff45dcb23bc7885a05125e73becc`
MD5	`6e644e3b146614e471aa3470628aa965`
BLAKE2b-256	`26b52446f6fba52e16aabdb71f50d10573be93082ea9d78059142b8e8442a5a6`

See more details on using hashes here.

spacy-ann-linker 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Requirements

Installation

Data Prerequisites

entities.jsonl Record Format

aliases.jsonl Record Format

spaCy prerequisites

Usage

Using the saved model

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes