Skip to main content

Toolkit for recognising named entities through structured labeling

Project description

# entity-recognition

## Intro
Framework for doing NER and other types of entity recognition in Python.

Baseline feature extraction relies on Brown clusters and typical NER features, similar to Roth & Ratinov 2009. We use CRFsuite and try to keep things modular and simple, so you're not stuck to just NEs - parts of speech, MWEs, temporal expressions and the whole smørrebrød of entity classes are up for grabs with this framework; simply adjust training data and labels to taste.

## Features
* Pluggable feature generation
* Support for Reddit/Twitter JSON formats
* State-of-the-art Twitter NER performance out-of-the-box

## Running
To get started, run `./train_tagger.py --help`

Toy data in `datasets/` top-level directory.

Should you like to tag data with your code, `./run_tagger.py --help` is your friend. Remember to keep the Brown clusters around!

For example, to learn a model from the Ritter NER CoNLL data, and then apply it to some Reddit JSON, try this:

$ ./train_tagger.py -f datasets/ritter.ner.conll \
--clusters brown_paths/gha.250M-c2000.paths --model \
ritter.socmed.crfsuite.model
$ ./run_tagger.py -f datasets/RC_2013-04.1000.json \
-c brown_paths/gha.250M-c2000.paths \
--model ritter.socmed.crfsuite.model \
--json --json-text body --stdout

An "entity_texts" top-level field is added, containing extracted entities. For example:

{
"archived": true,
"author": "walrusboy",
"author_flair_css_class": null,
"author_flair_text": null,
"body": "Quick, someone photoshop Natalie Portman!",
"controversiality": 0,
"created_utc": "1364774484",
"distinguished": null,
"downs": 0,
"edited": false,
"entity_texts": ["Natalie Portman"],
"gilded": 0,
"id": "c95zmil",
"link_id":
"name": "t1_c95zmil",
"parent_id": "t3_1bddiw",
"removal_reason": null,
"retrieved_on": 1431716826,
"score": 1,
"score_hidden": false,
"subreddit": "pics",
"subreddit_id": "t5_2qh0u",
"t3_1bddiw",
"ups": 1
}

## Dependencies
At least:

* Python 3
* NLTK
* pycrfsuite
* sklearn
* scipy
* numpy

Check you're using Python 3, with `python -V` (THAT'S A BIG V). Next, try something like:

$ sudo easy_install3 -U pip
$ sudo pip3 install numpy scipy sklearn python-crfsuite nltk

Then go for two cups of tea / one brief fika, after troubleshooting errors. If you get super stuck, sometimes it helps to try your distribution's Python 3 packages for numpy and scipy, and then upgrade them with something like:

$ sudo pip3 install -U numpy
$ sudo pip3 install -U scipy

## Hints and tips

If you use Brown clusters (and we recommend them!), this system expects cluster paths in binary branch format - à la `wcluster` - as opposed to base 10 paths, like from `JCLUSTER`. If you're not sure how many Brown clusters to use, check out our 3D interactive [guide to tuning Brown clustering](http://www.derczynski.com/sheffield/brown-tuning/).

## Reference
If you use this work, please cite our paper:

> Leon Derczynski, Isabelle Augenstein, Kalina Bontcheva (2015)<br />
> USFD: Twitter NER with Drift Compensation and Linked Data<br />
> Proceedings of the ACL Workshop on Noisy User-generated Text (W-NUT)<br />
> [[Paper]](https://aclweb.org/anthology/W/W15/W15-4306.pdf) [[bib]](https://aclweb.org/anthology/W/W15/W15-4306.bib)

Tools under active development until at least 2019 as part of the PHEME and COMRADES EU projects: www.pheme.eu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entity_recognition-1.0.tar.gz (7.1 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page