A Neuro-net ToPonym Recognition model

These details have not been verified by PyPI

Project links

Homepage

Project description

NeuroTPR

Overall description

NeuroTPR is a toponym recognition model designed for extracting locations from social media messages. It is based on a general Bidirectional Long Short-Term Memory network (BiLSTM) with a number of additional features, such as double layers of character embeddings, GloVe word embeddings, and contextualized word embeddings ELMo.

The goal of this model is to improve the accuracy of toponym recognition from social media messages that have various language irregularities, such as informal sentence structures, inconsistent upper and lower cases (e.g., “there is a HUGE fire near camino and springbrook rd”), name abbreviations (e.g., “bsu” for “Boise State University”), and misspellings. Particularly, NeuroTPR is designed to extract fine-grained locations such as streets, natural features, facilities, point of interest (POIs), and administrative units. We tested NeuroTPR in the application context of disaster response based on a dataset of tweets from Hurricane Harvey in 2017.

More details can be found in our paper: Wang, J., Hu, Y., & Joseph, K. (2020): NeuroTPR: A Neuro-net ToPonym Recognition model for extracting locations from social media messages. Transactions in GIS, 24(3), 719-735.

Figure 1. The overall architecture of NeuroTPR

Repository organization

"HarveyTweet" folder: This folder contains the Harvey2017 dataset with 1,000 human-annotated tweets.
"Model" folder: This folder contains the Python source codes for using the trained NeuroTPR model or retraining NeuroTPR using your own data.
"WikiDataHelper" folder: This folder contains the Python source codes to build up an annotated dataset from Wikipedia for training NeuroTPR.
"training_data" folder: This folder contains three training data sets (Wikipedia3000, WNUT2017, and 50 optional tweets from Hurricane Harvey) used for training NeuroTPR. Wikipedia3000 was automatically constructed from 3000 Wikipedia articles using our proposed workflow (more details can be found in the folder "WikiDataHelper"); WNUT2017 contains 599 tweets selected from the original dataset; and 50 optional tweets contain 50 crisis-related tweets from the Hurricane Harvey Twitter Dataset with door number addresses or street names.

Test datasets and model performance

NeuroTPR was tested on three different datasets, which are:

HarveyTweet: 1,000 human-annotated tweets from 2017 Hurricane Harvey. This dataset is available in the "HarveyTweet" folder.
GeoCorproa: 1,689 human-annotated tweets from the GeoCorpora Project.
Ju2016: 5,000 short sentences collected from Web pages and automatically annotated. This dataset is available at the EUPEG project.

We tested NeuroTPR using the benchmarking platform EUPEG. The performance of NeuroTPR on the three datasets is presented in the table below:

Corpora	Precision	Recall	F_score
HarveyTweet	0.787	0.678	0.728
GeoCorpora	0.800	0.761	0.780
Ju2016	-	0.821	-

Use the trained NeuroTPR for toponym recognition

Input is a single raw tweet: use function geoparse(text) from Model/geoparse.py file

Input Example: "The City of Dallas has now opened Shelter #3 at Samuel Grand Recreation Center, 6200 E. Grand Ave. #HurricaneHarvey"

Model output (JSON): [{"location_name": "City of Dallas", "start_idx": 5, "end_idx": 18}, {"location_name": "Samuel Grand Recreation Center", "start_idx": 50, "end_idx": 79}]

Input is a tweet dataset saved in the CoNLL2003 format

    python3 Model/geoparse_dataset.py

Output: toponym-name1,,statr-index,,end-index||toponym-name2,,statr-index,,end-index||...

Retrain NeuroTPR using your own data

If you wish to re-train NeuroTPR using your own data, you first need to add POS features to your own annoated dataset in CoNLL2003 format. You can use the following python codes to add POS features via NLTK tool.

    python3 Model/add_lin_features.py

To train NeuroTPR, you need to:

Set up the file path to load word embeddings and training data;
Set up the file path to save the trained model;
Tune the key hyper-parameters of the NeuroTPR

    python3 Model/train.py

Please see detailed comments in our source codes for changing the settings.

Project dependencies:

Python 3.6+ and a recent version of numpy
NLTK 3.5
Keras 2.3.0
Tensorflow 1.8.0+
Keras-contrib (https://github.com/keras-team/keras-contrib)
Tensorflow Hub (https://www.tensorflow.org/hub)
The rest should be installed alongside the five major libraries

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.9

Oct 16, 2020

0.0.8

Oct 16, 2020

0.0.7

Oct 15, 2020

0.0.6

Oct 15, 2020

This version

0.0.5

Oct 15, 2020

0.0.4

Oct 15, 2020

0.0.3

Oct 15, 2020

0.0.2

Oct 15, 2020

0.0.1

Oct 15, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neurotpr-0.0.5.tar.gz (14.2 kB view hashes)

Uploaded Oct 15, 2020 Source

Built Distribution

neurotpr-0.0.5-py3-none-any.whl (27.7 kB view hashes)

Uploaded Oct 15, 2020 Python 3

Hashes for neurotpr-0.0.5.tar.gz

Hashes for neurotpr-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`d5e9e190647f20a4a26479f8cb0a5f11cc14cb2f42f69f2b1850f45a5b08e671`
MD5	`d4aea144eac0a57ba6254791c4570d56`
BLAKE2b-256	`ac69fdccdd2794370a0eb06121c4945c5f62022801e769f299647d317c1d3ab1`

Hashes for neurotpr-0.0.5-py3-none-any.whl

Hashes for neurotpr-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`43c83b4b6bca28076f7f70300d39370137f68c798b81b06bf6dc0c36c550f28c`
MD5	`6bb9afcc8e08440dcd390c26b109156f`
BLAKE2b-256	`27f256074105db9349bd82bb925a21753564b0012b4505e9a15b31f945ca836c`