Skip to main content

A customizable geoparsing library for unstructured text

Project description

Irchel Geoparser

CI status Tests Coverage Package version Supported Python versions

The Irchel Geoparser (hereafter referred to simply as Geoparser) is a Python library designed as a complete end-to-end geoparsing pipeline. It integrates advanced natural language processing techniques to recognize and resolve place names (toponyms) in unstructured text, linking them to their corresponding geographical locations.

Overview

Geoparsing involves two main tasks:

  • Toponym Recognition: Identifying place names in text.
  • Toponym Resolution: Disambiguating these names to their correct geographical locations.

Geoparser addresses both tasks by combining state-of-the-art language models and efficient algorithms, enabling it to process large volumes of text with high accuracy and speed.

How It Works

  1. Input Processing: Users input texts as strings, which are preprocessed using a spaCy NLP pipeline. This includes tokenization and named entity recognition to identify toponyms in the form of names of geopolitical entities, locations, and facilities.

  2. Candidate Generation: For each toponym, the gazetteer database is queried to generate lists of potential candidate locations. This is done using a token-based greedy matching strategy designed to achieve high recall while keeping candidate lists concise.

  3. Textual Representation: Toponyms are represented using their surrounding context, which is extracted and truncated to meet model input length requirements. Candidate locations are also transformed into text by constructing descriptive sentences using attributes sourced from the gazetteer.

  4. Embedding Generation: A fine-tuned SentenceTransformer model is used to encode the textual representations of both the toponyms and their corresponding candidates into embeddings, mapping them into a shared vector space.

  5. Similarity Comparison: Embeddings of toponyms and their corresponding candidates are compared using cosine similarity. The candidates with the highest similarity scores are then selected as the most likely locations referenced by the toponyms.

Getting Started

To begin using Geoparser, refer to the installation and usage sections of the documentation.

Contributing

Geoparser is an open-source project, and contributions are welcome. If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.

Acknowledgments

Geoparser originated as part of my Master's thesis and was further developed with support from the Department of Geography at the University of Zurich. I thank my supervisor, Prof. Dr. Ross Purves, for his insightful feedback, encouragement, and the opportunity to continue this work as part of a research project.

License

Geoparser is released under the MIT License. It also uses several third-party libraries, each with its own license. For a complete list of these licenses, see the full license details in the repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geoparser-0.2.0.tar.gz (240.4 kB view details)

Uploaded Source

Built Distribution

geoparser-0.2.0-py3-none-any.whl (252.7 kB view details)

Uploaded Python 3

File details

Details for the file geoparser-0.2.0.tar.gz.

File metadata

  • Download URL: geoparser-0.2.0.tar.gz
  • Upload date:
  • Size: 240.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for geoparser-0.2.0.tar.gz
Algorithm Hash digest
SHA256 fe82fb7d2847d773c38deda4872aa22d5a5765e0b813c9ac6fa2fa492a00223f
MD5 57ce49dcd61f95220325362f221f40d9
BLAKE2b-256 01e8b22024eff6d3716f44899dde8c2b8eded010979344c05f733aa6c3a153d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for geoparser-0.2.0.tar.gz:

Publisher: ci.yml on dguzh/geoparser

Attestations:

File details

Details for the file geoparser-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: geoparser-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 252.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for geoparser-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e3ae0d0edc7b686150e2cda22c3aaeda52cbb45c94220b830b8c95d228e6ee38
MD5 8aa887976dc665588fa11b0f5e51a588
BLAKE2b-256 760802ec02eb4ed6a40b7c8acdb11e60fd228e2b390f2ece56d821de43a4f8dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for geoparser-0.2.0-py3-none-any.whl:

Publisher: ci.yml on dguzh/geoparser

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page