Skip to main content

Toolkit for recognising named entities through structured labeling

Project description

# entity-recognition

## Intro
Framework for doing NER and other types of entity recognition in Python.

Baseline feature extraction relies on Brown clusters and typical NER features, similar to Roth & Ratinov 2009. We use CRFsuite and try to keep things modular and simple, so you're not stuck to just NEs - parts of speech, MWEs, temporal expressions and the whole smørrebrød of entity classes are up for grabs with this framework; simply adjust training data and labels to taste.

## Features
* Pluggable feature generation
* Support for Reddit/Twitter JSON formats
* State-of-the-art Twitter NER performance out-of-the-box

## Running
To get started, run `./train_tagger.py --help`

Toy data in `datasets/` top-level directory.

Should you like to tag data with your code, `./run_tagger.py --help` is your friend. Remember to keep the Brown clusters around!

For example, to learn a model from the Ritter NER CoNLL data, and then apply it to some Reddit JSON, try this:

$ ./train_tagger.py -f datasets/ritter.ner.conll \
--clusters brown_paths/gha.250M-c2000.paths --model \
ritter.socmed.crfsuite.model
$ ./run_tagger.py -f datasets/RC_2013-04.1000.json \
-c brown_paths/gha.250M-c2000.paths \
--model ritter.socmed.crfsuite.model \
--json --json-text body --stdout

An "entity_texts" top-level field is added, containing extracted entities. For example:

{
"archived": true,
"author": "walrusboy",
"author_flair_css_class": null,
"author_flair_text": null,
"body": "Quick, someone photoshop Natalie Portman!",
"controversiality": 0,
"created_utc": "1364774484",
"distinguished": null,
"downs": 0,
"edited": false,
"entity_texts": ["Natalie Portman"],
"gilded": 0,
"id": "c95zmil",
"link_id":
"name": "t1_c95zmil",
"parent_id": "t3_1bddiw",
"removal_reason": null,
"retrieved_on": 1431716826,
"score": 1,
"score_hidden": false,
"subreddit": "pics",
"subreddit_id": "t5_2qh0u",
"t3_1bddiw",
"ups": 1
}

## Dependencies
At least:

* Python 3
* NLTK
* pycrfsuite
* sklearn
* scipy
* numpy

Check you're using Python 3, with `python -V` (THAT'S A BIG V). Next, try something like:

$ sudo easy_install3 -U pip
$ sudo pip3 install numpy scipy sklearn python-crfsuite nltk

Then go for two cups of tea / one brief fika, after troubleshooting errors. If you get super stuck, sometimes it helps to try your distribution's Python 3 packages for numpy and scipy, and then upgrade them with something like:

$ sudo pip3 install -U numpy
$ sudo pip3 install -U scipy

## Hints and tips

If you use Brown clusters (and we recommend them!), this system expects cluster paths in binary branch format - à la `wcluster` - as opposed to base 10 paths, like from `JCLUSTER`. If you're not sure how many Brown clusters to use, check out our 3D interactive [guide to tuning Brown clustering](http://www.derczynski.com/sheffield/brown-tuning/).

## Reference
If you use this work, please cite our paper:

> Leon Derczynski, Isabelle Augenstein, Kalina Bontcheva (2015)<br />
> USFD: Twitter NER with Drift Compensation and Linked Data<br />
> Proceedings of the ACL Workshop on Noisy User-generated Text (W-NUT)<br />
> [[Paper]](https://aclweb.org/anthology/W/W15/W15-4306.pdf) [[bib]](https://aclweb.org/anthology/W/W15/W15-4306.bib)

Tools under active development until at least 2019 as part of the PHEME and COMRADES EU projects: www.pheme.eu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entity_recognition-1.0.tar.gz (7.1 MB view details)

Uploaded Source

File details

Details for the file entity_recognition-1.0.tar.gz.

File metadata

File hashes

Hashes for entity_recognition-1.0.tar.gz
Algorithm Hash digest
SHA256 65d0bf9fd5838c886e6e7e0bfb5be783908a32c7100c123c051c9f05c7968270
MD5 6c7e170889b7cb744b6c83a92b15d425
BLAKE2b-256 7ca91e704b857fa4264dc83426caa7edde07d9f16ed385acf523b03902be6b36

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page