Skip to main content

Detect & Extract locations from text or URL and find relationships among locations

Project description

locationtagger

version 0.0.1

Detect and extract locations (Countries, Regions/States & Cities) from text or URL. Also, find relationships among countries, regions & cities.


About Project

In the field of Natural Lauguage Processing, many algorithms have been derived for different types of syntactic & semantic analysis of the textual data. NER (Named Entity Recognition) is one of the best & frequently needed tasks in real-world problems of text mining that follows some grammer-based rules & statistical modelling approaches. An entity extracted from NER can be a name of person, place, organization or product. locationtagger is a further process of tagging & filter out place names (locations) amongst all the entities found with NER.

Approach followed is given below in the picture;

https://github.com/kaushiksoni10/locationtagger/blob/master/locationtagger/data/diagram.jpg?raw=true Approach


Install and Setup

(Environment: python >= 3.5)

Install the package using pip -

pip install locationtagger

But before we install the package, we need to install some useful libraries given below,

nltk

spacy

newspaper3k

pycountry

After installing these packages, there are some important nltk & spacy modules that need to be downloaded using commands given in /locationtagger/bin/locationtagger-nltk-spacy on IPython shell or Jupyter notebook.


Usage

After proper installation of the package, import the module and give some text/URL as input;

Text as input

import locationtagger

text = "Unlike India and Japan, A winter weather advisory remains in effect through 5 PM along and east of a line from Blue Earth, to Red Wing line in Minnesota and continuing to along an Ellsworth, to Menomonie, and Chippewa Falls line in Wisconsin."

entities = locationtagger.find_locations(text = text)


Now we can grab all the place names present in above text,

entities.countries

['India', 'Japan']

entities.regions

['Minnesota', 'Wisconsin']

entities.cities

['Ellsworth', 'Red Wing', 'Blue Earth', 'Chippewa Falls', 'Menomonie']


Apart from above places extracted from the text, we can also find the countries where these extracted cities, regions belong to,

entities.country_regions

{'United States': ['Minnesota', 'Wisconsin']}

entities.country_cities

{'United States': ['Ellsworth', 'Red Wing', 'Blue Earth', 'Chippewa Falls', 'Menomonie']}


Since "United States" is a country but not present in the text still came from the relations to the cities & regions present in the text, we can find it in other_countries,

entities.other_countries

['United States']


If we are really serious about the cities we got in the text we can find which regions in the world it may fall in,

entities.region_cities

{'Maine': ['Ellsworth'], 'Minnesota': ['Red Wing', 'Blue Earth'], 'Wisconsin': ['Ellsworth', 'Chippewa Falls', 'Menomonie'], 'Pennsylvania': ['Ellsworth'], 'Michigan': ['Ellsworth'], 'Illinois': ['Ellsworth'], 'Kansas': ['Ellsworth'], 'Iowa': ['Ellsworth']}


And obviously, we'll put these regions in other_regions since they are not present in original text,

entities.other_regions

['Maine', 'Minnesota', 'Wisconsin', 'Pennsylvania', 'Michigan', 'Illinois', 'Kansas', 'Iowa']


Whatever words nltk & spacy both grabbed from the original text as named entity , most of them are stored in cities, regions & countries. But the remaining words (not recognized as place name) will be stored in other.

entities.other

['winter', 'PM', 'Chippewa']

URL as Input

Similarly, It can grab places from urls too,

URL = 'https://edition.cnn.com/2020/01/14/americas/staggering-number-of-human-rights-defenders-killed-in-colombia-the-un-says/index.html'
entities2 = locationtagger.find_locations(url = URL)


outputs we get: countries;

entities2.countries

['Switzerland', 'Colombia']


regions;

entities2.regions

['Geneva']


cities;

entities2.cities

['Geneva', 'Colombia']


Now, if we want to check how many times a place has been mentioned or most common places which have been mentioned in the whole page of the URL, we can have an idea about what location that page is talking about;

hence, most commonly mentioned countries;

entities2.country_mentions

[('Colombia', 3), ('Switzerland', 1), ('United States', 1), ('Mexico', 1)]


and most commonly mentioned cities;

entities2.city_mentions

[('Colombia', 3), ('Geneva', 1)]


Credits

locationtagger uses data from following source for country, region & city lookups,

GEOLITE2 free downloadable database

Apart from famous nlp libraries NLTK & spacy, locationtagger uses following very useful libraries;

pycountry

newspaper3k

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

locationtagger-0.0.1.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

locationtagger-0.0.1-py3-none-any.whl (1.6 MB view details)

Uploaded Python 3

File details

Details for the file locationtagger-0.0.1.tar.gz.

File metadata

  • Download URL: locationtagger-0.0.1.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for locationtagger-0.0.1.tar.gz
Algorithm Hash digest
SHA256 f87a16a076341eceb827f1b375b714386d7b886b3fec449a9464a972955325a8
MD5 9fbcff56b16e62baa52ce21a7f24ae37
BLAKE2b-256 48fb9b7a5f874fe5d54b6de8aeb9ff0fd86e720a38716e7c444b1f3683601ba8

See more details on using hashes here.

File details

Details for the file locationtagger-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: locationtagger-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for locationtagger-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e6e653f8f298f66cfc6e1c3dc391d9139090a5d551a6ef5eaa61f3f017fce90d
MD5 1e753e7e13dbc789373088aefe660542
BLAKE2b-256 b9398605ba7c1729160e98727b85d98239d234772799a766293c37a53ec42724

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page