A geoparser for Finnish texts
Project description
This project has had significant updates and rewrites recently.
Updated documentation will be added in May 2024.
Finger: Finnish geoparser
Geoparsing is the process of finding location mentions (toponyms, aka. place names) in texts (toponym recognition or geotagging) and defining geographical representations, such as coordinate points, for them (toponym resolution or geocoding). Finger is a geoparser for Finnish texts. This program consists of three classes: the toponym recognizer, the toponym resolver, and the geoparser, which wraps the two previous modules.
Toponym recognizer (geotagger)
The geotagger is built using Spacy NLP library and it implements BERT-based language model for a more accurate representation of language and thus better results. The pipeline runs a complete linguistic analysis (part-of-speech tagging, morphological analysis, dependency parsing, token and sentence segmentation, lemmatization), but named entity recognition (NER) is the important part. Input texts' named locations, such as countries, lakes and important sights, are recognized, then returned to their base form using the lemmatizer. These results are passed on to the geocoder.
Toponym resolver (geocoder)
The geocoder currently simply queries the GeoNames gazetteer using the Python library Geocoder and outputs coordinate points, if matches are found. I plan to expand this functionality in the future.
Data model
Currently, the program accepts strings or lists of strings as input. The input is assumed to be in Finnish and segmented to short-ish pieces (so that the input isn't for example a whole book chapter as a string).
Most users will want to use the geoparser module, as it wraps geoparsing pipeline and functions under a simple principle: text in, results out. See below for an example. The output of the process is a Pandas dataframe with the following columns:
Column header | Description | Data type | Example |
---|---|---|---|
input_text | The input sentence | string | "Matti Järvi vietti tänään hienon päivän Lahdessa" |
input_order | The index of the inserted texts. i.e. the first text is 0, the second 1 etc. | int | 0 |
doc | Spacy doc object of the sent analysis | doc | Matti Järvi vietti tänään hienon päivän Lahdessa |
locations_found | Whether locations were found in the input sent | boolean | True |
locations | Location tokens in the og wordform, if found | (list of) string(s) or none | "Lahdessa" |
loc_lemmas | Lemmatized versions of the locations | (list of) string(s) or none | "Lahti" |
loc_spans | index of the start and end characters of the identified locations in the input text string | tuple | (40, 48) |
names | Versions of the locations returned by querying GeoNames | (list of) string(s) or none | "Lahti" |
coord_points | Long/lat coordinate points in WGS84 | (list of) tuple(s), Shapely Point(s) or none | (25.66151, 60.98267) |
* id | The identifying element, like tweet id, tied to each input text. Optional | string, int, float | "first_sentence" |
NOTE. There's some redundancy in the output currently. This is mostly because I want to cover every base at this point. The data model is still subject to change as the work progresses.
Usage
There are number of preparation steps involved to use this geoparser.
Preparations
- I highly recommend creating a virtual environment to prevent clashes with other packages.
- Install Finger from Pypi with
pip install fingerGeoparser
- This should install all the dependencies and the geoparser.
- Next, you'll need the spaCy pipeline, which for example includes the fine-tuned BERT model. The pipeline wheel is released here. Simply install it like this:
pip install https://github.com/Tadusko/finger-NLP-resources/releases/download/v0.1.0/fi_geoparser-0.1.0-py3-none-any.whl
- This adds the pipeline (fi_geoparser) in your pip environment.
- Voikko is used for lemmatizing (e.g. Turussa -> Turku) the input texts.
- Using voikko may require downloading a dictionary file and a DLL file (on Windows). Follow the instructions listed here if you get voikko related errors.
- NOTE getting the DLL to work on Windows can be a hassle. I had to add path to the folder with the DLL as a system path.
- Create a GeoNames account (or a few). The account's username is used as an API key when querying GN and is provided for Finger when geoparsing.
These steps need only be done once.
Usage example
Python interpreter is started in the Finnish geoparser folder and in an environment with the required installations.
>>>from geoparser import geoparser
>>>parser = geoparser(gn_username='my_username')
>>>input = ["Matti Järvi vietti tänään hienon päivän Lahden Messilässä", "Olympialaisten avajaiset tekstitettiin suomen kielelle"]
>>>results = parser.geoparse(input)
Starting geotagging...
Finished geotagging. 1 location hits found.
Starting geocoding...
Total elapsed time: 0.33 s
>>>print(results[['loc_lemmas','coord_points']])
loc_lemmas coord_points
0 [Lahti, Messilä] [(25.66151, 60.98267), (25.56667, 61.01667)]
1 None None
If you want to find out more about the geoparser and the input parameters, call
help(geoparser)
License and credits
The source code is licensed under the MIT license.
Other resources used in either the pipeline or this code:
- FinBERT language model by TurkuNLP, CC BY 4.0. See Virtanen, Kanerva, Ilo, Luoma, Luotolahti, Salakoski, Ginter and Pyysalo; 2019
- Turku NER Corpus by TurkuNLP, CC BY 4.0. See Luoma, Oinonen, Pyykönen, Laippala and Pyysalo; 2020
- Spacy-fi pipeline by Antti Ajanki, MIT License.
TODO
Alter the output so that each successfully geoparsed toponym is in a row of its own. Toponyms from the same input can be connected with an id from another column.Add toponym's location in the input, e.g. character span in the input string from start to end, as a column.- Package this project and make it pip-installable. Overall, make installation and usage more straightforward.
- Learn more about and implement tests.
- Some sort of config file or argument parser instead of passing parameters to the geoparse method?
- Test out the lemmatizer more. I think it might've problems with rarer place names. Like Vesijärvellä is fine, but Joutjärvellä doesn't get lemmatized. Extend dictionary or implement another type of lemmatizer?
- Implement a Voikko-based typo checker and fixer.
- Implement gazetteers/API's other than GeoNames. Nimisampo has potential in the Finnish context.
Implement text-preprocessing steps. Removing hashtags for instance?Rudimentary filtering of impossibly short toponyms added. Seems to work well.- Implement geocoding / toponym resolution step other than a simple query. The literature should provide hints.
- Use the linguistic pipeline results (stored in the doc object) in some way. Useful in toponym resolution?
Add an identifier keyword argument. If this is present, it'll be added to the output df and can be used to identify the individual inputs (e.g. tweets by id, texts by their writer). Maybe require a list that's as long as the input list? So that each id is assumed to be in the same index as each input.- Allow the user to limit the spatial extent of their geoparsing results by passing a bbox when calling geoparse
- Rewrite geocoding: for now, move it to use geopandas/geopy instead of Geocoder
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fingergeoparser-0.2.0.tar.gz
.
File metadata
- Download URL: fingergeoparser-0.2.0.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.8.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a517d9261aec7e2fe3bf3f9565f6ae8fd39a91f9fb1988a9043ce924911e6619 |
|
MD5 | 6486fd74b5803b7cb82c0bd4e90d1933 |
|
BLAKE2b-256 | 7ff58ba6253ac5c746502ef93d1de8a0559f24e3213ef49a859561a9d3c7893f |
File details
Details for the file fingerGeoparser-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: fingerGeoparser-0.2.0-py3-none-any.whl
- Upload date:
- Size: 15.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.8.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6949df03824edf739acd35795702fea1fd49cc655d510f1da5737ae2d9715a0c |
|
MD5 | bb80e10173329ca066035a417c56feea |
|
BLAKE2b-256 | 2cf68d5ac16f730b8c4f3863e90a58b306311928dbc029b3d8ff31f52e09f161 |