A geoparsing library for English texts
Project description
Geoparser
Geoparser is a Python library for geoparsing English texts. It leverages spaCy for toponym recognition and tine-tuned SentenceTransformer models for toponym resolution.
Installation
Install Geoparser using pip:
pip install geoparser
Download Required Data
After installation, you need to download the necessary data files for Geoparser to function properly:
python -m geoparser download
This command will download the following resources:
- spaCy Models: Two models are downloaded:
en_core_web_sm
: A less accurate but faster model.en_core_web_trf
: A more accurate but slower model.
- GeoNames Data: The following files are downloaded from GeoNames:
These files are stored in the user-specific data directory:
- Windows:
C:\Users\<Username>\AppData\Local\geoparser\
- macOS:
~/Library/Application Support/geoparser/
- Linux:
~/.local/share/geoparser/
Please ensure you have adequate disk space available, as the total size of these files is approximately 2.3 GB.
Usage
Instantiating the Geoparser
To use Geoparser, you need to instantiate an object of the Geoparser
class. You can specify which spaCy and transformer model to use, optimizing either for accuracy or speed. By default, the library uses accuracy-optimized models:
from geoparser import Geoparser
geo = Geoparser()
For faster performance, you can opt for the smaller models:
geo = Geoparser(spacy_model='en_core_web_sm', transformer_model='dguzh/geo-all-MiniLM-L6-v2')
You can mix and match these models depending on your specific needs. Note that the transformer models dguzh/geo-all-distilroberta-v1
and dguzh/geo-all-MiniLM-L6-v2
are preliminary versions. Future updates aim to refine these models to improve the accuracy of toponym disambiguation.
Parsing Texts
Geoparser is optimized for parsing large collections of texts at once. Pass a list of strings to the parse
method:
docs = geo.parse(["Sample text 1", "Sample text 2", "Sample text 3"])
The parse
method returns a list of Document
objects, where each Document
contains a list of Toponym
objects. Each Toponym
that is successfully geocoded will have a corresponding Location
object with detailed geographical data:
Location Attributes
Each Location
object has the following attributes:
geonameid
: The unique identifier for the place in the GeoNames database.name
: The name of the geographical location.admin2_geonameid
: The GeoNames identifier for the second-level administrative division.admin2_name
: The name of the second-level administrative division.admin1_geonameid
: The GeoNames identifier for the first-level administrative division.admin1_name
: The name of the first-level administrative division.country_geonameid
: The GeoNames identifier for the country.country_name
: The name of the country.feature_name
: The type of geographical feature (e.g., mountain, lake).latitude
: The latitude of the location.longitude
: The longitude of the location.elevation
: The elevation of the location in meters.population
: The population of the location.
Example
Here's an example showing how the library might be used:
text = "Zurich is a city rich in history."
docs = geo.parse([text])
for doc in docs:
for toponym in doc.toponyms:
if toponym.location:
print(toponym, "->", toponym.location)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for geoparser-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7316f7ece9aed9e0631ca7352a082217992ef0b4e9efbadd6584dc55534b5bde |
|
MD5 | da271d8e55fa5b3ae8cf7afb10cd79ac |
|
BLAKE2b-256 | 88660a8c3e57c72d65ad9d11293fa0cf4034722bed9f16784684d4539830b3d1 |