Toolkit for recognising named entities through structured labeling
Project description
# entity-recognition
## Intro
Framework for doing NER and other types of entity recognition in Python.
Baseline feature extraction relies on Brown clusters and typical NER features, similar to Roth & Ratinov 2009. We use CRFsuite and try to keep things modular and simple, so you're not stuck to just NEs - parts of speech, MWEs, temporal expressions and the whole smørrebrød of entity classes are up for grabs with this framework; simply adjust training data and labels to taste.
## Features
* Pluggable feature generation
* Support for Reddit/Twitter JSON formats
* State-of-the-art Twitter NER performance out-of-the-box
## Running
To get started, run `./train_tagger.py --help`
Toy data in `datasets/` top-level directory.
Should you like to tag data with your code, `./run_tagger.py --help` is your friend. Remember to keep the Brown clusters around!
For example, to learn a model from the Ritter NER CoNLL data, and then apply it to some Reddit JSON, try this:
$ ./train_tagger.py -f datasets/ritter.ner.conll \
--clusters brown_paths/gha.250M-c2000.paths --model \
ritter.socmed.crfsuite.model
$ ./run_tagger.py -f datasets/RC_2013-04.1000.json \
-c brown_paths/gha.250M-c2000.paths \
--model ritter.socmed.crfsuite.model \
--json --json-text body --stdout
An "entity_texts" top-level field is added, containing extracted entities. For example:
{
"archived": true,
"author": "walrusboy",
"author_flair_css_class": null,
"author_flair_text": null,
"body": "Quick, someone photoshop Natalie Portman!",
"controversiality": 0,
"created_utc": "1364774484",
"distinguished": null,
"downs": 0,
"edited": false,
"entity_texts": ["Natalie Portman"],
"gilded": 0,
"id": "c95zmil",
"link_id":
"name": "t1_c95zmil",
"parent_id": "t3_1bddiw",
"removal_reason": null,
"retrieved_on": 1431716826,
"score": 1,
"score_hidden": false,
"subreddit": "pics",
"subreddit_id": "t5_2qh0u",
"t3_1bddiw",
"ups": 1
}
## Dependencies
At least:
* Python 3
* NLTK
* pycrfsuite
* sklearn
* scipy
* numpy
Check you're using Python 3, with `python -V` (THAT'S A BIG V). Next, try something like:
$ sudo easy_install3 -U pip
$ sudo pip3 install numpy scipy sklearn python-crfsuite nltk
Then go for two cups of tea / one brief fika, after troubleshooting errors. If you get super stuck, sometimes it helps to try your distribution's Python 3 packages for numpy and scipy, and then upgrade them with something like:
$ sudo pip3 install -U numpy
$ sudo pip3 install -U scipy
## Hints and tips
If you use Brown clusters (and we recommend them!), this system expects cluster paths in binary branch format - à la `wcluster` - as opposed to base 10 paths, like from `JCLUSTER`. If you're not sure how many Brown clusters to use, check out our 3D interactive [guide to tuning Brown clustering](http://www.derczynski.com/sheffield/brown-tuning/).
## Reference
If you use this work, please cite our paper:
> Leon Derczynski, Isabelle Augenstein, Kalina Bontcheva (2015)<br />
> USFD: Twitter NER with Drift Compensation and Linked Data<br />
> Proceedings of the ACL Workshop on Noisy User-generated Text (W-NUT)<br />
> [[Paper]](https://aclweb.org/anthology/W/W15/W15-4306.pdf) [[bib]](https://aclweb.org/anthology/W/W15/W15-4306.bib)
Tools under active development until at least 2019 as part of the PHEME and COMRADES EU projects: www.pheme.eu
## Intro
Framework for doing NER and other types of entity recognition in Python.
Baseline feature extraction relies on Brown clusters and typical NER features, similar to Roth & Ratinov 2009. We use CRFsuite and try to keep things modular and simple, so you're not stuck to just NEs - parts of speech, MWEs, temporal expressions and the whole smørrebrød of entity classes are up for grabs with this framework; simply adjust training data and labels to taste.
## Features
* Pluggable feature generation
* Support for Reddit/Twitter JSON formats
* State-of-the-art Twitter NER performance out-of-the-box
## Running
To get started, run `./train_tagger.py --help`
Toy data in `datasets/` top-level directory.
Should you like to tag data with your code, `./run_tagger.py --help` is your friend. Remember to keep the Brown clusters around!
For example, to learn a model from the Ritter NER CoNLL data, and then apply it to some Reddit JSON, try this:
$ ./train_tagger.py -f datasets/ritter.ner.conll \
--clusters brown_paths/gha.250M-c2000.paths --model \
ritter.socmed.crfsuite.model
$ ./run_tagger.py -f datasets/RC_2013-04.1000.json \
-c brown_paths/gha.250M-c2000.paths \
--model ritter.socmed.crfsuite.model \
--json --json-text body --stdout
An "entity_texts" top-level field is added, containing extracted entities. For example:
{
"archived": true,
"author": "walrusboy",
"author_flair_css_class": null,
"author_flair_text": null,
"body": "Quick, someone photoshop Natalie Portman!",
"controversiality": 0,
"created_utc": "1364774484",
"distinguished": null,
"downs": 0,
"edited": false,
"entity_texts": ["Natalie Portman"],
"gilded": 0,
"id": "c95zmil",
"link_id":
"name": "t1_c95zmil",
"parent_id": "t3_1bddiw",
"removal_reason": null,
"retrieved_on": 1431716826,
"score": 1,
"score_hidden": false,
"subreddit": "pics",
"subreddit_id": "t5_2qh0u",
"t3_1bddiw",
"ups": 1
}
## Dependencies
At least:
* Python 3
* NLTK
* pycrfsuite
* sklearn
* scipy
* numpy
Check you're using Python 3, with `python -V` (THAT'S A BIG V). Next, try something like:
$ sudo easy_install3 -U pip
$ sudo pip3 install numpy scipy sklearn python-crfsuite nltk
Then go for two cups of tea / one brief fika, after troubleshooting errors. If you get super stuck, sometimes it helps to try your distribution's Python 3 packages for numpy and scipy, and then upgrade them with something like:
$ sudo pip3 install -U numpy
$ sudo pip3 install -U scipy
## Hints and tips
If you use Brown clusters (and we recommend them!), this system expects cluster paths in binary branch format - à la `wcluster` - as opposed to base 10 paths, like from `JCLUSTER`. If you're not sure how many Brown clusters to use, check out our 3D interactive [guide to tuning Brown clustering](http://www.derczynski.com/sheffield/brown-tuning/).
## Reference
If you use this work, please cite our paper:
> Leon Derczynski, Isabelle Augenstein, Kalina Bontcheva (2015)<br />
> USFD: Twitter NER with Drift Compensation and Linked Data<br />
> Proceedings of the ACL Workshop on Noisy User-generated Text (W-NUT)<br />
> [[Paper]](https://aclweb.org/anthology/W/W15/W15-4306.pdf) [[bib]](https://aclweb.org/anthology/W/W15/W15-4306.bib)
Tools under active development until at least 2019 as part of the PHEME and COMRADES EU projects: www.pheme.eu
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file entity_recognition-1.0.tar.gz
.
File metadata
- Download URL: entity_recognition-1.0.tar.gz
- Upload date:
- Size: 7.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 65d0bf9fd5838c886e6e7e0bfb5be783908a32c7100c123c051c9f05c7968270 |
|
MD5 | 6c7e170889b7cb744b6c83a92b15d425 |
|
BLAKE2b-256 | 7ca91e704b857fa4264dc83426caa7edde07d9f16ed385acf523b03902be6b36 |