Skip to main content

Extract countries, regions and cities from a URL or text

Project description

geograpy3

Join the discussion at https://github.com/somnathrakshit/geograpy3/discussions Documentation Status pypi Github Actions Build PyPI Status Downloads GitHub issues GitHub closed issues License

geograpy3 is a fork of geograpy2, which is itself a fork of geograpy and inherits most of it, but solves several problems (such as support for utf8, places names with multiple words, confusion over homonyms etc). Also, geograpy3 is compatible with Python 3, unlike geograpy2.

since geograpy3 0.0.2 cities,countries and regions are matched against a database derived from the corresponding wikidata entries

What it is

geograpy extracts place names from a URL or text, and adds context to those names -- for example distinguishing between a country, region or city.

The extraction is a two step process. The first process is a Natural Language Processing task which analyzes a text for potential mentions of geographic locations. In the next step the words which represent such locations are looked up using the Locator.

If you already know that your content has geographic information you might want to use the Locator interface directly.

Examples/Tutorial

Install & Setup

Grab the package using pip (this will take a few minutes)

pip install geograpy3

geograpy3 uses NLTK for entity recognition, so you'll also need to download the models we're using. Fortunately there's a command that'll take care of this for you.

geograpy-nltk

Command Line Usage

geograpy3 provides a command-line interface for extracting geographic information from text, URLs, and for locating cities.

Extract places from text

geograpy -t "Paris is the capital of France. Berlin is in Germany."

Output:

Countries: ['Germany', 'France']
Regions: []
Cities: ['Paris', 'Berlin']
Other: []

Extract places from a URL

geograpy -u https://en.wikipedia.org/wiki/2012_Summer_Olympics_torch_relay

Locate a city with disambiguation

geograpy -l "Paris, Texas"

Output:

Paris (US-TX(Texas) - US(United States of America))

The locator disambiguates between cities with the same name based on region and country context.

Recreate the database

geograpy -db

This downloads and recreates the location database from Wikidata.

All CLI options

geograpy -h

Options:

  • -u URL, --url URL - extract places from the given URL
  • -t TEXT, --text TEXT - extract places from the given text
  • -l LOCATION, --location LOCATION - locate a city (e.g. 'Paris, Texas')
  • -db, --recreateDatabase - recreate the database
  • -cm, --correctSpelling - correct typical misspellings
  • -d, --debug - show debug information
  • -V, --version - show program version

Getting the source code

git clone https://github.com/somnathrakshit/geograpy3
cd geograpy3
scripts/install

Basic Usage

Import the module, give some text or a URL, and presto.

import geograpy
url = 'https://en.wikipedia.org/wiki/2012_Summer_Olympics_torch_relay'
places = geograpy.get_geoPlace_context(url=url)

Now you have access to information about all the places mentioned in the linked article.

  • places.countries contains a list of country names
  • places.regions contains a list of region names
  • places.cities contains a list of city names
  • places.other lists everything that wasn't clearly a country, region or city

Note that the other list might be useful for shorter texts, to pull out information like street names, points of interest, etc, but at the moment is a bit messy when scanning longer texts that contain possessive forms of proper nouns (like "Russian" instead of "Russia").

But Wait, There's More

In addition to listing the names of discovered places, you'll also get some information about the relationships between places.

  • places.country_regions regions broken down by country
  • places.country_cities cities broken down by country
  • places.address_strings city, region, country strings useful for geocoding

Last But Not Least

While a text might mention many places, it's probably focused on one or two, so geograpy3 also breaks down countries, regions and cities by number of mentions.

  • places.country_mentions
  • places.region_mentions
  • places.city_mentions

Each of these returns a list of tuples. The first item in the tuple is the place name and the second item is the number of mentions. For example:

[('Russian Federation', 14), (u'Ukraine', 11), (u'Lithuania', 1)]  

If You're Really Serious

You can of course use each of Geograpy's modules on their own. For example:

from geograpy import extraction

e = extraction.Extractor(url='https://en.wikipedia.org/wiki/2012_Summer_Olympics_torch_relay')
e.find_geoEntities()

# You can now access all of the places found by the Extractor
print(e.places)

Place context is handled in the places module. For example:

from geograpy import places

pc = places.PlaceContext(['Cleveland', 'Ohio', 'United States'])

pc.set_countries()
print pc.countries #['United States']

pc.set_regions()
print(pc.regions #['Ohio'])

pc.set_cities()
print(pc.cities #['Cleveland'])

print(pc.address_strings #['Cleveland, Ohio, United States'])

And of course all of the other information shown above (country_regions etc) is available after the corresponding set_ method is called.

Stackoverflow

Credits

geograpy3 uses the following excellent libraries:

  • NLTK for entity recognition
  • newspaper4k for text extraction from HTML
  • jellyfish for fuzzy text match
  • pylodstorage for storage and retrieval of tabular data from SQL and SPARQL sources

geograpy3 uses the following data sources:

Hat tip to Chris Albon for the name.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geograpy3-0.3.0.tar.gz (59.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geograpy3-0.3.0-py3-none-any.whl (39.6 kB view details)

Uploaded Python 3

File details

Details for the file geograpy3-0.3.0.tar.gz.

File metadata

  • Download URL: geograpy3-0.3.0.tar.gz
  • Upload date:
  • Size: 59.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for geograpy3-0.3.0.tar.gz
Algorithm Hash digest
SHA256 f199f5965451c7ce3fc61dd58daf4623d2a695cbcc36c594b01991096a8c373d
MD5 55255f51319ea03090a65f33f4d25a84
BLAKE2b-256 951bc52ef14e0c40d27fb0b1b4233062a9293121c09bfaf3754734e9d92e7f20

See more details on using hashes here.

File details

Details for the file geograpy3-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: geograpy3-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 39.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for geograpy3-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f0ee6e9abe06438258d2198811240b19af81c4ae42c2d359a57b9983a8ff453d
MD5 091fe370c44d250a8a215563fb03f8b7
BLAKE2b-256 ec59cd106711f29d954920b2f0a93a78244b52bf52dd99ffb0ec95083698e08e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page