Few-shot Named Entity Recognition
Project description
Implemented by sayef .
UPDATES
- Training script is now available.
- Pairwise query and support examples are not required anymore. Please look into example usage for details.
- Added sample dataset and links to converted ontonotes5 training and validation dataset (please see dataset preparation section below).
Overview
The FSNER model was proposed in Example-Based Named Entity Recognition by Morteza Ziyadi, Yuting Sun, Abhishek Goswami, Jade Huang, Weizhu Chen. To identify entity spans in a new domain, it uses a train-free few-shot learning approach inspired by question-answering.
Abstract
We present a novel approach to named entity recognition (NER) in the presence of scarce data that we call example-based NER. Our train-free few-shot learning approach takes inspiration from question-answering to identify entity spans in a new and unseen domain. In comparison with the current state-of-the-art, the proposed method performs significantly better, especially when using a low number of support examples.
Model Training Details
identifier | epochs | datasets |
---|---|---|
sayef/fsner-bert-base-uncased | 25 | ontonotes5, conll2003, wnut2017, mit_movie_trivia, mit_restaurant and fin (Alvarado et al.). |
Installation and Example Usage
You can use the FSNER model in 3 ways:
-
Install directly from PyPI:
pip install fsner
and import the model as shown in the code example belowor
-
Install from source:
python install .
and import the model as shown in the code example belowor
-
Clone repo and add absolute path of
fsner/src
directory to your PYTHONPATH and import the model as shown in the code example below
import json
from fsner import FSNERModel, FSNERTokenizerUtils, pretty_embed
query_texts = [
"Does Luke's serve lunch?",
"Chang does not speak Taiwanese very well.",
"I like Berlin."
]
# Each list in supports are the examples of one entity type
# Wrap entities around with [E] and [/E] in the examples.
# Each sentence should have only one pair of [E] ... [/E]
support_texts = {
"Restaurant": [
"What time does [E] Subway [/E] open for breakfast?",
"Is there a [E] China Garden [/E] restaurant in newark?",
"Does [E] Le Cirque [/E] have valet parking?",
"Is there a [E] McDonalds [/E] on main street?",
"Does [E] Mike's Diner [/E] offer huge portions and outdoor dining?"
],
"Language": [
"Although I understood no [E] French [/E] in those days , I was prepared to spend the whole day with Chien - chien .",
"like what the hell 's that called in [E] English [/E] ? I have to register to be here like since I 'm a foreigner .",
"So , I 'm also working on an [E] English [/E] degree because that 's my real interest .",
"Al - Jazeera TV station , established in November 1996 in Qatar , is an [E] Arabic - language [/E] news TV station broadcasting global news and reports nonstop around the clock .",
"They think it 's far better for their children to be here improving their [E] English [/E] than sitting at home in front of a TV . \"",
"The only solution seemed to be to have her learn [E] French [/E] .",
"I have to read sixty pages of [E] Russian [/E] today ."
]
}
device = 'cpu'
tokenizer = FSNERTokenizerUtils("sayef/fsner-bert-base-uncased")
queries = tokenizer.tokenize(query_texts).to(device)
supports = tokenizer.tokenize(list(support_texts.values())).to(device)
model = FSNERModel("sayef/fsner-bert-base-uncased")
model.to(device)
p_starts, p_ends = model.predict(queries, supports)
# One can prepare supports once and reuse multiple times with different queries
# ------------------------------------------------------------------------------
# start_token_embeddings, end_token_embeddings = model.prepare_supports(supports)
# p_starts, p_ends = model.predict(queries, start_token_embeddings=start_token_embeddings,
# end_token_embeddings=end_token_embeddings)
output = tokenizer.extract_entity_from_scores(query_texts, queries, p_starts, p_ends,
entity_keys=list(support_texts.keys()), thresh=0.50)
print(json.dumps(output, indent=2))
# install displacy for pretty embed
pretty_embed(query_texts, output, list(support_texts.keys()))
Datasets preparation
- We need to convert dataset into the following format. Let's say we have a dataset file train.json like following.
- Each list in supports are the examples of one entity type
- Wrap entities around with [E] and [/E] in the examples.
- Each example should have only one pair of [E] ... [/E].
{
"CARDINAL_NUMBER": [
"Washington , cloudy , [E] 2 [/E] to 6 degrees .",
"New Dehli , sunny , [E] 6 [/E] to 19 degrees .",
"Well this is number [E] two [/E] .",
"....."
],
"LANGUAGE": [
"They do n't have the Quicken [E] Dutch [/E] version ?",
"they learned a lot of [E] German [/E] .",
"and then [E] Dutch [/E] it 's Mifrau",
"...."
],
"MONEY": [
"Per capita personal income ranged from $ [E] 11,116 [/E] in Mississippi to $ 23,059 in Connecticut ... .",
"The trade surplus was [E] 582 million US dollars [/E] .",
"It settled with a loss of 4.95 cents at $ [E] 1.3210 [/E] a pound .",
"...."
]
}
-
Converted ontonotes5 dataset can be found here:
-
Then trainer script can be used to train/evaluate your fsner model.
fsner trainer --pretrained-model bert-base-uncased --mode train --train-data train.json --val-data val.json \
--train-batch-size 6 --val-batch-size 6 --n-examples-per-entity 10 --neg-example-batch-ratio 1/3 --max-epochs 25 --device gpu \
--gpus -1 --strategy ddp
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file fsner-0.1.1.tar.gz
.
File metadata
- Download URL: fsner-0.1.1.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
71a814c2521ffcb098eb885db44f6a7a9d6dc974c16e3c22e94edeb0c0dfd3a1
|
|
MD5 |
bd26d36e7ca7bd190dcd60f0f439b783
|
|
BLAKE2b-256 |
68cc380178f8e9ee8406bcfec0ab90cbcc178edaa114aa02c47558e4ad24450e
|