Skip to main content

A geoparsing library for English texts

Project description

Geoparser

Geoparser is a Python library for geoparsing English texts. It leverages spaCy for toponym recognition and fine-tuned SentenceTransformer models for toponym resolution.

Installation

Install Geoparser using pip:

pip install geoparser

Dependencies

Geoparser depends on the following Python libraries:

These dependencies are automatically installed when building Geoparser with pip.

Download Required Data

After installation, you need to execute the following command to download the necessary files for Geoparser to function:

python -m geoparser download

This command will download the following resources and setup a SQLite database for the GeoNames data:

These files are temporarily stored in your system's user-specific data directory during the database setup. Once the database has been populated with the data, the original files are automatically deleted to free up space. The database is then stored in this location:

  • Windows: C:\Users\<Username>\AppData\Local\geoparser\geonames.db
  • macOS: ~/Library/Application Support/geoparser/geonames.db
  • Linux: ~/.local/share/geoparser/geonames.db

Please ensure you have enough disk space available. The final size of the downloaded GeoNames data will be approximately 3.2 GB, increasing temporarily to 5.5 GB during the download and setup process.

Usage

Instantiating the Geoparser

To use Geoparser, you need to instantiate an object of the Geoparser class. You can specify which spaCy and transformer model to use, optimising either for accuracy or speed. By default, the library uses accuracy-optimised models:

from geoparser import Geoparser

geo = Geoparser()

Default configuration:

geo = Geoparser(spacy_model='en_core_web_trf', transformer_model='dguzh/geo-all-distilroberta-v1')

For faster performance, you can opt for the smaller models:

geo = Geoparser(spacy_model='en_core_web_sm', transformer_model='dguzh/geo-all-MiniLM-L6-v2')

You can mix and match these models depending on your specific needs. Note that the SentenceTransformer models dguzh/geo-all-distilroberta-v1 and dguzh/geo-all-MiniLM-L6-v2 are preliminary versions. Future updates aim to refine these models to improve the accuracy of toponym disambiguation.

Parsing Texts

Geoparser is optimised for parsing large collections of texts at once. Pass a list of strings to the parse method:

docs = geo.parse(["Sample text 1", "Sample text 2", "Sample text 3"])

The parse method returns a list of Document objects, where each Document contains a list of Toponym objects. Each Toponym that is successfully geocoded will have a corresponding Location object with the following attributes:

  • geonameid: The unique identifier for the place in the GeoNames database.
  • name: The name of the geographical location.
  • admin2_geonameid: The GeoNames identifier for the second-level administrative division.
  • admin2_name: The name of the second-level administrative division.
  • admin1_geonameid: The GeoNames identifier for the first-level administrative division.
  • admin1_name: The name of the first-level administrative division.
  • country_geonameid: The GeoNames identifier for the country.
  • country_name: The name of the country.
  • feature_name: The type of geographical feature (e.g., mountain, lake).
  • latitude: The latitude of the location.
  • longitude: The longitude of the location.
  • elevation: The elevation of the location in meters.
  • population: The population of the location.

Geocoding Scope

You can limit the scope of geocoding by specifying one or more countries and GeoNames feature classes. This ensures that Geoparser only encodes locations within the specified countries, and can limit the types of geographical features to consider. To use this feature, use the country_filter and feature_filter parameters in the parse method:

docs = geo.parse(texts, country_filter=['CH', 'DE', 'AT'], feature_filter=['A', 'P'])

Example

Here's an example illustrating how the library might be used:

from geoparser import Geoparser

geo = Geoparser()

texts = [
    "Zurich is a city rich in history.",
    "Geneva is known for its role in international diplomacy.",
    "Munich is famous for its annual Oktoberfest celebration."
]    

docs = geo.parse([texts])

for doc in docs:
    for toponym in doc.toponyms:
        if toponym.location:
            print(toponym, "->", toponym.location)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geoparser-0.1.6.tar.gz (12.2 kB view hashes)

Uploaded Source

Built Distribution

geoparser-0.1.6-py3-none-any.whl (11.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page