A Python module for geotagging Japanese texts.
Project description
pygeonlp, A python module for geotagging Japanese texts
pygeonlp
is an open source software for geotagging/geoparsing
Japanese natural language text to extract place names.
How To Use
Import pygeonlp.api
and initialize it by specifying the directory
where the place-name database is placed.
import pygeonlp.api as geonlp_api
geonlp_api.init(dict_dir='mydic')
Then, run geoparse("text to parse")
.
result = geonlp_api.geoparse("国立情報学研究所は千代田区にあります。")
The result is a list of dict objects, with POS/Spatial attributes assigned to each word.
A GeoJSON representation is obtained by JSON-encoding each dict object.
print(json.dumps(result, indent=2, ensure_ascii=False))
[
{
"type": "Feature",
"geometry": null,
"properties": {
"surface": "国立",
"node_type": "NORMAL",
"morphemes": {
"conjugated_form": "名詞-固有名詞-地名語",
"conjugation_type": "*",
"original_form": "国立",
"pos": "名詞",
"prononciation": "コクリツ",
"subclass1": "固有名詞",
"subclass2": "地名修飾語",
"subclass3": "*",
"surface": "国立",
"yomi": "コクリツ"
}
}
}, ...
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [
139.753634,
35.694003
]
},
"properties": {
"surface": "千代田区",
"node_type": "GEOWORD",
"morphemes": {
"conjugated_form": "*",
"conjugation_type": "*",
"original_form": "千代田区",
"pos": "名詞",
"prononciation": "",
"subclass1": "固有名詞",
"subclass2": "地名語",
"subclass3": "WWIY7G:千代田区",
"surface": "千代田区",
"yomi": ""
},
"geoword_properties": {
"address": "東京都千代田区",
"body": "千代田",
"body_variants": "千代田",
"code": {},
"countyname": "",
"countyname_variants": "",
"dictionary_id": 1,
"entry_id": "13101A1968",
"geolod_id": "WWIY7G",
"hypernym": [
"東京都"
],
"latitude": "35.69400300",
"longitude": "139.75363400",
"ne_class": "市区町村",
"prefname": "東京都",
"prefname_variants": "東京都",
"source": "1/千代田区役所/千代田区九段南1-2-1/P34-14_13.xml",
"suffix": [
"区"
],
"valid_from": "",
"valid_to": "",
"dictionary_identifier": "geonlp:geoshape-city"
}
}
},
{
"type": "Feature",
"geometry": null,
"properties": {
"surface": "に",
"node_type": "NORMAL",
"morphemes": {
"conjugated_form": "*",
"conjugation_type": "*",
"original_form": "に",
"pos": "助詞",
"prononciation": "ニ",
"subclass1": "格助詞",
"subclass2": "一般",
"subclass3": "*",
"surface": "に",
"yomi": "ニ"
}
}
},...
]
Pre-requirements
pygeonlp
requires MeCab C++ library and UTF8 dictionary for Japanese morphological analysis.
Also, the C++ implementation part depends on Boost C++.
$ sudo apt install libmecab-dev mecab-ipadic-utf8 libboost-all-dev
Install
The pygeonlp package can be installed with the pip
command.
It is recommended that you upgrade pip and setuptools to
the latest versions before running it.
$ pip install --upgrade pip setuptools
$ pip install pygeonlp
Install GDAL library (Optional)
If the GDAL library is installed,
pygeonlp
can use "spatial distance" for disambiguation
when there are multiple place names with the same name, thus improving accuracy.
You can also use spatial filters.
$ sudo apt install libgdal-dev
$ pip install gdal
Install jageocoder (Optional)
pygeonlp
can use address-geocoding if
the jageocoder is installed.
$ pip install jageocoder
$ mkdir db/
$ wget https://www.info-proto.com/static/jusho.zip
$ unzip jusho.zip -d db/
$ python
>>> import jageocoder
>>> jageocoder.init(dsn='sqlite:///db/address.db', trie_path='db/address.trie')
>>> jageocoder.create_trie_index()
Run Tests (Optional)
Run the API tests with python setup.py test
command.
Uninstall
Use pip
command to uninstall.
$ pip uninstall pygeonlp
Registering a place-name word analysis dictionary
Execute the script to register the basic place name word analysis dictionaries
(*.json
, *.csv
) in base_data/
into the database under mydic/
.
$ python scripts/setup_dictionaries.py
This script registers three dictionaries:
"Prefectures of Japan" (geonlp:geoshape-pref
),
"Historical Administrative Area Data Set Beta Dictionary of Place Names"
(geonlp:geoshape-city
), and "Railroad Stations in Japan (2019)"
(geonlp:ksj-station-N02-2019
).
Delete the place-name database
When you register a place-name word analysis dictionary to the database, it will create a sqlite3 database and some other files in the specified directory.
If you want to delete them, just delete the whole directory.
$ rm -r mydic/
License
Acknowledgements
This software is partially supported by PRESTO program of Japan Science and Technology Agency (JST).
This software is partially supported by Research Organization of Information and Systems (ROIS).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pygeonlp-1.0.0rc4.zip
.
File metadata
- Download URL: pygeonlp-1.0.0rc4.zip
- Upload date:
- Size: 1.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.6.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3f7467fbde35532dd99281784c1f834d1de3d44df8c856d70ececc5bbb02513a |
|
MD5 | 4d7c10416b810204c5e7abbbb0a2e63e |
|
BLAKE2b-256 | 80973eaaf01172cac7b92092c64f0db45dd88162034596a16c75cc926dc82121 |