Skip to main content

A geoparsing library for English texts

Project description

Geoparser

Geoparser is a Python library for geoparsing unstructured texts. It employs spaCy for toponym recognition and fine-tuned SentenceTransformer models for toponym resolution.

Installation

Install Geoparser using pip:

pip install geoparser

Dependencies

Geoparser depends on the following Python libraries:

These dependencies are automatically installed when building Geoparser with pip.

GPU support: The performance of Geoparser benefits greatly from GPU processing. If you have a CUDA enabled GPU available, you can use it for toponym recognition with spaCy's transformer models as well as for toponym resolution using the SentenceTransformers models. To do so, install PyTorch with CUDA support as well as the GPU enabled version of spaCy.

Download Required Data

To get started with Geoparser, specific data resources must be downloaded:

spaCy Models

You should manually download the desired spaCy model based on your specific needs. For example, to download the default recommended model for English texts, run:

python -m spacy download en_core_web_trf

For an overview of available spaCy models, visit the spaCy models documentation.

Gazetteer Data

Geoparser uses gazetteer data to resolve toponyms to geographic locations. The default gazetteer is GeoNames, and it can be set up with the following command:

python -m geoparser download geonames

This command downloads and sets up a SQLite database with GeoNames data necessary for geoparsing. The following files are downloaded from GeoNames:

These files are temporarily stored in your system's user-specific data directory during the database setup. Once the database has been populated with the data, the original files are automatically deleted to free up space. The database is then stored in this location:

  • Windows: C:\Users\<Username>\AppData\Local\geoparser\geonames.db
  • macOS: ~/Library/Application Support/geoparser/geonames.db
  • Linux: ~/.local/share/geoparser/geonames.db

Please ensure you have enough disk space available. The final size of the downloaded GeoNames data will be approximately 3.2 GB, increasing temporarily to around 5 GB during the download and setup process.

Note: The library currently only supports the GeoNames gazetteer, but the framework allows for future extensions with other knowledge bases.

Usage

Instantiating the Geoparser

To use Geoparser, instantiate an object of the Geoparser class with optional specifications for the spaCy model, transformer model, and gazetteer. By default, the library uses an accuracy-optimised configuration:

from geoparser import Geoparser

geo = Geoparser()

Default configuration:

geo = Geoparser(spacy_model='en_core_web_trf', transformer_model='dguzh/geo-all-distilroberta-v1', gazetteer='geonames')

For faster performance, you may opt for more lightweight models:

geo = Geoparser(spacy_model='en_core_web_sm', transformer_model='dguzh/geo-all-MiniLM-L6-v2', gazetteer='geonames')

You can mix and match these models depending on your specific needs. Note that the SentenceTransformer models dguzh/geo-all-distilroberta-v1 and dguzh/geo-all-MiniLM-L6-v2 are preliminary versions. Future updates aim to refine these models to improve the accuracy of toponym disambiguation.

Parsing Texts

Geoparser is optimised for parsing large collections of texts at once. To perform parsing, supply a list of strings to the parse method. This method processes the input and returns a list of GeoDoc objects, each containing identified and resolved toponyms:

docs = geo.parse(["Sample text 1", "Sample text 2", "Sample text 3"])

The GeoDoc class extends spaCy's Doc class, inheriting all its functionalities. You can access the toponyms identified in each document through GeoDoc.toponyms, which returns a tuple of GeoSpan objects representing the toponyms in the document. The GeoSpan class is an extension of spaCy's Span class and inherits all its functionalities:

for doc in docs:
    for toponym in doc.toponyms:
        print(toponym, toponym.start_char, toponym.end_char)

Toponyms are resolved to their corresponding geographical location which can be accessed using GeoSpan.location. This returns a dictionary with geographic data sourced from the gazetteer:

for doc in docs:
    for toponym in doc.toponyms:
        if toponym.location:
            print(toponym, toponym.location['geonameid'], toponym.location['latitude'], toponym.location['longitude'])

Example of a location dictionary using the GeoNames gazetteer:

{
'geonameid': 2867714,
'name': 'Munich',
'admin2_geonameid': 2861322,
'admin2_name': 'Upper Bavaria',
'admin1_geonameid': 2951839,
'admin1_name': 'Bavaria',
'country_geonameid': 2921044,
'country_name': 'Germany',
'feature_name': 'seat of a first-order administrative division',
'latitude': 48.13743,
'longitude': 11.57549,
'elevation': None,
'population': 1260391
}

The certainty of the toponym resolution predictions can be retrieved using the GeoSpan.score property. Users may choose to only consider predictions above a certain threshold as valid.

For document-wise retrieval of location data you may want to use the GeoDoc.locations attribute to retrieve lists of location dictionaries aligned with GeoDoc.toponyms. This allows for more efficient batch retrieval of location data, reducing the number of database queries:

  • To get a list of location dictionaries of all toponyms in a document:
all_locations = doc.locations
  • To retrieve specific attributes:
all_geonameids = doc.locations['geonameid']
  • To retrieve multiple attributes:
all_coordinates = doc.locations['latitude', 'longitude']

Geocoding Scope

You can limit the scope of geocoding by specifying one or more countries and GeoNames feature classes. This ensures that Geoparser only encodes locations within the specified countries, and can limit the types of geographical features to consider. To use this feature, use the country_filter and feature_filter parameters in the parse method:

docs = geo.parse(texts, country_filter=['CH', 'DE', 'AT'], feature_filter=['A', 'P'])

Example

Here's an example illustrating how the Geoparser might be used:

from geoparser import Geoparser

geo = Geoparser()

texts = [
    "Zurich is a city rich in history.",
    "Geneva is known for its role in international diplomacy.",
    "Munich is famous for its annual Oktoberfest celebration."
]    

docs = geo.parse(texts)

for doc in docs:
    identifiers = doc.locations['name', 'admin1_name', 'country_name']
    for toponym, identifier in zip(doc.toponyms, identifiers):
        print(toponym, "->", identifier)

Training Custom Geoparser Models

The GeoparserTrainer is an extension of the Geoparser class designed for training and evaluating geoparsing models with custom datasets. This allows users to fine-tune transformer models specific to their texts or domains.

Fine-Tuning HuggingFace Models for Geoparsing

The GeoparserTrainer supports fine-tuning any transformer model from HuggingFace that is compatible with the SentenceTransformers framework. This allows users to leverage a wide range of pre-trained models to enhance geoparsing capabilities tailored to specific needs.

While it is possible to fine-tune virtually any HuggingFace model that works within the SentenceTransformers ecosystem, the employed geoparsing strategy benefits from models that are pre-trained on sentence similarity tasks. For an overview of pre-trained SentenceTransformer models that are optimised for tasks like sentence similarity, please refer to the official SentenceTransformers documentation.

Preparing Your Dataset

To train a custom geoparser model, you need to prepare a dataset formatted as a list of tuples, where each tuple contains a text string and an associated list of annotations. Annotations should be tuples of (start character, end character, location id) that mark the toponyms within the text:

train_corpus = [
    ("Zurich is a city in Switzerland.", [(0, 6, '2657896'), (20, 31, '2658434')]),
    ("Geneva is known for international diplomacy.", [(0, 6, '2660646')]),
    ("Munich hosts the annual Oktoberfest.", [(0, 6, '2867714')])
]

Annotating and Preparing Training Data

Once you have your dataset, use the annotate method to convert the text and annotations into gold GeoDoc objects suitable for training:

from geoparser import GeoparserTrainer

trainer = GeoparserTrainer(transformer_model="bert-base-uncased")

train_docs = trainer.annotate(train_corpus)

Training the Model

You can then train a model using the prepared documents:

output_path = "path_to_custom_model"

trainer.train(train_docs, output_path=output_path)

Evaluating the Model

After training, you can use the fine-tuned model to resolve toponyms in a test set and evaluate how well your model performed:

test_corpus = [
    ...
]

test_docs = trainer.annotate(test_corpus)

trainer.resolve(test_docs)

evaluation_results = trainer.evaluate(test_docs)

print(evaluation_results)

This compares the predicted location IDs against the annotated IDs and provides the following metrics:

  • Exact match accuracy
  • Accuracy within a 161 km radius
  • Mean error distance [km]
  • Area under the curve

Using the Trained Model

Once trained, you can use your custom model to parse new texts by specifying the trained transformer model's path when instantiating Geoparser:

from geoparser import Geoparser

geo = Geoparser(transformer_model="path_to_custom_model")

docs = geo.parse(["New text to parse"])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geoparser-0.1.7.tar.gz (18.3 kB view hashes)

Uploaded Source

Built Distribution

geoparser-0.1.7-py3-none-any.whl (16.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page