Skip to main content

A customizable geoparsing library for unstructured text

Project description

Irchel Geoparser

CI Tests Coverage PypiPackage PythonVersions

The Irchel Geoparser (hereafter referred to simply as Geoparser) is a Python library designed as a complete end-to-end geoparsing pipeline. It integrates advanced natural language processing techniques to recognize and resolve place names (toponyms) in unstructured text, linking them to their corresponding geographical locations.

Overview

Geoparsing involves two main tasks:

  • Toponym Recognition: Identifying place names in text.
  • Toponym Resolution: Disambiguating these names to their correct geographical locations.

Geoparser addresses both tasks by combining state-of-the-art language models and efficient algorithms, enabling it to process large volumes of text with high accuracy and speed.

How It Works

  1. Input Processing: Users input texts as strings, which are preprocessed using a spaCy NLP pipeline. This includes tokenization and named entity recognition to identify toponyms in the form of names of geopolitical entities, locations, and facilities.

  2. Candidate Generation: For each toponym, the gazetteer database is queried to generate lists of potential candidate locations. This is done using a token-based greedy matching strategy designed to achieve high recall while keeping candidate lists concise.

  3. Textual Representation: Toponyms are represented using their surrounding context, which is extracted and truncated to meet model input length requirements. Candidate locations are also transformed into text by constructing descriptive sentences using attributes sourced from the gazetteer.

  4. Embedding Generation: A fine-tuned SentenceTransformer model is used to encode the textual representations of both the toponyms and their corresponding candidates into embeddings, mapping them into a shared vector space.

  5. Similarity Comparison: Embeddings of toponyms and their corresponding candidates are compared using cosine similarity. The candidates with the highest similarity scores are then selected as the most likely locations referenced by the toponyms.

Getting Started

To begin using Geoparser, refer to the installation and usage sections of the documentation.

Contributing

Geoparser is an open-source project, and contributions are welcome. If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.

Acknowledgments

Geoparser originated as part of my Master's thesis and was further developed with support from the Department of Geography at the University of Zurich. I thank my supervisor, Prof. Dr. Ross Purves, for his insightful feedback, encouragement, and the opportunity to continue this work as part of a research project.

License

Geoparser is released under the MIT License. It also uses several third-party libraries, each with its own license. For a complete list of these licenses, see the full license details in the repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geoparser-0.2.3.tar.gz (278.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geoparser-0.2.3-py3-none-any.whl (63.0 kB view details)

Uploaded Python 3

File details

Details for the file geoparser-0.2.3.tar.gz.

File metadata

  • Download URL: geoparser-0.2.3.tar.gz
  • Upload date:
  • Size: 278.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geoparser-0.2.3.tar.gz
Algorithm Hash digest
SHA256 83fca906d329434d2b54020e44cf38c8886ed03a48162acbb9f2108d8f12f398
MD5 cba65067f2fb7db12500293885911b67
BLAKE2b-256 af81af6ebcdc36118881c1eaf5a61fc560ea89db9dafe1d2d236bf189a74a1cf

See more details on using hashes here.

Provenance

The following attestation bundles were made for geoparser-0.2.3.tar.gz:

Publisher: ci.yml on dguzh/geoparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file geoparser-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: geoparser-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 63.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geoparser-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 94b52547b095c5f420959b6d05f71a99d9712c3aace60a63d3243c2f47f7240a
MD5 f2e8416b82b0d5ac3494703bf6b4468e
BLAKE2b-256 048ace93cb87273d58521e98adb5030acf2258a1a3474f3b63dffc98ba4dc288

See more details on using hashes here.

Provenance

The following attestation bundles were made for geoparser-0.2.3-py3-none-any.whl:

Publisher: ci.yml on dguzh/geoparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page