A geoparsing library for English texts
Project description
Geoparser
Geoparser is a Python library for geoparsing English texts. It leverages spaCy for toponym recognition and fine-tuned SentenceTransformer models for toponym resolution.
Installation
Install Geoparser using pip:
pip install geoparser
Dependencies
Geoparser depends on the following Python libraries:
These dependencies are automatically installed when building Geoparser with pip.
Download Required Data
After installation, you need to execute the following command to download the necessary files for Geoparser to function:
python -m geoparser download
This command will download the following resources and setup a SQLite database for the GeoNames data:
- spaCy Models: Two models are downloaded:
en_core_web_sm
: A less accurate but faster model.en_core_web_trf
: A more accurate but slower model.
- GeoNames Data: The following files are downloaded from GeoNames:
These files are temporarily stored in your system's user-specific data directory during the database setup. Once the database has been populated with the data, the original files are automatically deleted to free up space. The database is then stored in this location:
- Windows:
C:\Users\<Username>\AppData\Local\geoparser\geonames.db
- macOS:
~/Library/Application Support/geoparser/geonames.db
- Linux:
~/.local/share/geoparser/geonames.db
Please ensure you have enough disk space available. The final size of the downloaded GeoNames data will be approximately 3.2 GB, increasing temporarily to 5.5 GB during the download and setup process.
Usage
Instantiating the Geoparser
To use Geoparser, you need to instantiate an object of the Geoparser
class. You can specify which spaCy and transformer model to use, optimising either for accuracy or speed. By default, the library uses accuracy-optimised models:
from geoparser import Geoparser
geo = Geoparser()
Default configuration:
geo = Geoparser(spacy_model='en_core_web_trf', transformer_model='dguzh/geo-all-distilroberta-v1')
For faster performance, you can opt for the smaller models:
geo = Geoparser(spacy_model='en_core_web_sm', transformer_model='dguzh/geo-all-MiniLM-L6-v2')
You can mix and match these models depending on your specific needs. Note that the SentenceTransformer models dguzh/geo-all-distilroberta-v1
and dguzh/geo-all-MiniLM-L6-v2
are preliminary versions. Future updates aim to refine these models to improve the accuracy of toponym disambiguation.
Parsing Texts
Geoparser is optimised for parsing large collections of texts at once. Pass a list of strings to the parse
method:
docs = geo.parse(["Sample text 1", "Sample text 2", "Sample text 3"])
The parse
method returns a list of Document
objects, where each Document
contains a list of Toponym
objects. Each Toponym
that is successfully geocoded will have a corresponding Location
object with the following attributes:
geonameid
: The unique identifier for the place in the GeoNames database.name
: The name of the geographical location.admin2_geonameid
: The GeoNames identifier for the second-level administrative division.admin2_name
: The name of the second-level administrative division.admin1_geonameid
: The GeoNames identifier for the first-level administrative division.admin1_name
: The name of the first-level administrative division.country_geonameid
: The GeoNames identifier for the country.country_name
: The name of the country.feature_name
: The type of geographical feature (e.g., mountain, lake).latitude
: The latitude of the location.longitude
: The longitude of the location.elevation
: The elevation of the location in meters.population
: The population of the location.
Geocoding Scope
You can limit the scope of geocoding by specifying one or more countries and GeoNames feature classes. This ensures that Geoparser only encodes locations within the specified countries, and can limit the types of geographical features to consider. To use this feature, use the country_filter
and feature_filter
parameters in the parse
method:
docs = geo.parse(texts, country_filter=['CH', 'DE', 'AT'], feature_filter=['A', 'P'])
Example
Here's an example illustrating how the library might be used:
from geoparser import Geoparser
geo = Geoparser()
texts = [
"Zurich is a city rich in history.",
"Geneva is known for its role in international diplomacy.",
"Munich is famous for its annual Oktoberfest celebration."
]
docs = geo.parse([texts])
for doc in docs:
for toponym in doc.toponyms:
if toponym.location:
print(toponym, "->", toponym.location)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for geoparser-0.1.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9637cab822214c5d7e1262105974c7d57705bc8f179ae06a9045618054246edf |
|
MD5 | a2149c28019ef46530eaebcb27059397 |
|
BLAKE2b-256 | a8d6c0568358daba12e6688c6dda06fb9e5c0fdccdca0b83cac4eda2c50caa2b |