spaCy ANN Linker, a pipeline component for generating spaCy KnowledgeBase Alias Candidates for Entity Linking.
Project description
spaCy ANN Linker, a pipeline component for generating spaCy KnowledgeBase Alias Candidates for Entity Linking.
Documentation: https://microsoft.github.io/spacy-ann-linker
Source Code: https://github.com/microsoft/spacy-ann-linker
spaCy ANN Linker is a spaCy a pipeline component for generating alias candidates for spaCy entities in doc.ents
. It provides an optional interface for linking ambiguous aliases based on descriptions for each entity.
The key features are:
-
Easy spaCy Integration: spaCy ANN Linker provides completely serializable spaCy pipeline components that integrate directly into your existing spaCy model.
-
CLI for simple Index Creation: Simply run
spacy_ann create_index
with your data to create an Approximate Nearest Neighbors index from your data, make anann_linker
pipeline component and save a spaCy model. -
Built in Web API for easy deployment and Batch Entity Linking queries
Requirements
Python 3.6+
spaCy ANN Linker is convenient wrapper built on a few comprehensive, high-performing packages.
Installation
$ pip install spacy-ann-linker
---> 100%
Successfully installed spacy-ann-linker
Data Prerequisites
To use this spaCy ANN Linker you need pre-existing Knowledge Base data. spaCy ANN Linker expects data to exist in 2 JSONL files together in a directory
kb_dir
│ aliases.jsonl
│ entities.jsonl
For testing the package, you can use the example data in examples/tutorial/data
examples/tutorial/data
│ aliases.jsonl
│ entities.jsonl
entities.jsonl Record Format
{"id": "Canonical Entity Id", "description": "Entity Description used for Disambiguation"}
Example data
{"id": "a1", "description": "Machine learning (ML) is the scientific study of algorithms and statistical models..."}
{"id": "a2", "description": "ML (\"Meta Language\") is a general-purpose functional programming language. It has roots in Lisp, and has been characterized as \"Lisp with types\"."}
{"id": "a3", "description": "Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data."}
{"id": "a4", "description": "Neuro-linguistic programming (NLP) is a pseudoscientific approach to communication, personal development, and psychotherapy created by Richard Bandler and John Grinder in California, United States in the 1970s."}
...
aliases.jsonl Record Format
{"alias": "alias string", "entities": ["list", "of", "entity", "ids"], "probabilities": [0.5, 0.5]}
Example data
{"alias": "ML", "entities": ["a1", "a2"], "probabilities": [0.5, 0.5]}
{"alias": "Machine learning", "entities": ["a1"], "probabilities": [1.0]}
{"alias": "Meta Language", "entities": ["a2"], "probabilities": [1.0]}
{"alias": "NLP", "entities": ["a3", "a4"], "probabilities": [0.5, 0.5]}
{"alias": "Natural language processing", "entities": ["a3"], "probabilities": [1.0]}
{"alias": "Neuro-linguistic programming", "entities": ["a4"], "probabilities": [1.0]}
...
spaCy prerequisites
If you don't have a pretrained spaCy model, download one now. The model needs to have vectors
so download a model bigger than en_core_web_sm
$ spacy download en_core_web_md
---> 100%
Successfully installed en_core_web_md
Follow the Tutorial
Once you have the Data and spaCy prerequisites completed follow along with the Tutorial to for a step-by-step guide for using the spacy_ann
package.
License
This project is licensed under the terms of the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for spacy_ann_linker-0.0.8-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 73eb01d9648e1cfb903684b3b608dee0348d5dc3039a8ca5f58a384f22d712e3 |
|
MD5 | ab5f43fc444d9b4011d024712529d191 |
|
BLAKE2b-256 | 5a5a7793bf4159c4b58062e256ef819acfeae6ab2f0daa6fd0f03f0f9b833bf2 |