A customizable geoparsing library for unstructured text
Project description
Irchel Geoparser
The Irchel Geoparser (hereafter referred to simply as Geoparser) is a Python library designed as a complete end-to-end geoparsing pipeline. It integrates advanced natural language processing techniques to recognize and resolve place names (toponyms) in unstructured text, linking them to their corresponding geographical locations.
Overview
Geoparsing involves two main tasks:
- Toponym Recognition: Identifying place names in text.
- Toponym Resolution: Disambiguating these names to their correct geographical locations.
Geoparser addresses both tasks by combining state-of-the-art language models and efficient algorithms, enabling it to process large volumes of text with high accuracy and speed.
How It Works
-
Input Processing: Users input texts as strings, which are preprocessed using a spaCy NLP pipeline. This includes tokenization and named entity recognition to identify toponyms in the form of names of geopolitical entities, locations, and facilities.
-
Candidate Generation: For each toponym, the gazetteer database is queried to generate lists of potential candidate locations. This is done using a token-based greedy matching strategy designed to achieve high recall while keeping candidate lists concise.
-
Textual Representation: Toponyms are represented using their surrounding context, which is extracted and truncated to meet model input length requirements. Candidate locations are also transformed into text by constructing descriptive sentences using attributes sourced from the gazetteer.
-
Embedding Generation: A fine-tuned SentenceTransformer model is used to encode the textual representations of both the toponyms and their corresponding candidates into embeddings, mapping them into a shared vector space.
-
Similarity Comparison: Embeddings of toponyms and their corresponding candidates are compared using cosine similarity. The candidates with the highest similarity scores are then selected as the most likely locations referenced by the toponyms.
Getting Started
To begin using Geoparser, refer to the installation and usage sections of the documentation.
Contributing
Geoparser is an open-source project, and contributions are welcome. If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
Acknowledgments
Geoparser originated as part of my Master's thesis and was further developed with support from the Department of Geography at the University of Zurich. I thank my supervisor, Prof. Dr. Ross Purves, for his insightful feedback, encouragement, and the opportunity to continue this work as part of a research project.
License
Geoparser is released under the MIT License. It also uses several third-party libraries, each with its own license. For a complete list of these licenses, see the full license details in the repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file geoparser-0.2.0.tar.gz
.
File metadata
- Download URL: geoparser-0.2.0.tar.gz
- Upload date:
- Size: 240.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fe82fb7d2847d773c38deda4872aa22d5a5765e0b813c9ac6fa2fa492a00223f |
|
MD5 | 57ce49dcd61f95220325362f221f40d9 |
|
BLAKE2b-256 | 01e8b22024eff6d3716f44899dde8c2b8eded010979344c05f733aa6c3a153d2 |
Provenance
The following attestation bundles were made for geoparser-0.2.0.tar.gz
:
Publisher:
ci.yml
on dguzh/geoparser
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
geoparser-0.2.0.tar.gz
- Subject digest:
fe82fb7d2847d773c38deda4872aa22d5a5765e0b813c9ac6fa2fa492a00223f
- Sigstore transparency entry: 150317849
- Sigstore integration time:
- Predicate type:
File details
Details for the file geoparser-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: geoparser-0.2.0-py3-none-any.whl
- Upload date:
- Size: 252.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e3ae0d0edc7b686150e2cda22c3aaeda52cbb45c94220b830b8c95d228e6ee38 |
|
MD5 | 8aa887976dc665588fa11b0f5e51a588 |
|
BLAKE2b-256 | 760802ec02eb4ed6a40b7c8acdb11e60fd228e2b390f2ece56d821de43a4f8dc |
Provenance
The following attestation bundles were made for geoparser-0.2.0-py3-none-any.whl
:
Publisher:
ci.yml
on dguzh/geoparser
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
geoparser-0.2.0-py3-none-any.whl
- Subject digest:
e3ae0d0edc7b686150e2cda22c3aaeda52cbb45c94220b830b8c95d228e6ee38
- Sigstore transparency entry: 150317852
- Sigstore integration time:
- Predicate type: