Skip to main content

A Python module for geotagging Japanese texts.

Project description

pygeonlp, A python module for geotagging Japanese texts

pygeonlp is an open source software for geotagging/geoparsing Japanese natural language text to extract place names.

More detailed Japanese documentation and API references are available in the /doc directory. You can also find the latest online documentation at GeoNLP Documentation.

How To Use

Import pygeonlp.api and initialize it by specifying the directory where the place-name database is placed.

>>> import pygeonlp.api as api
>>> api.init(db_dir='mydic')

Then, run geoparse("text to parse") .

>>> result = api.geoparse("国立情報学研究所は千代田区にあります。")

The result is a list of dict objects, with POS/Spatial attributes assigned to each word.

A GeoJSON representation is obtained by JSON-encoding each dict object.

>>> import json
>>> print(json.dumps(result, indent=2, ensure_ascii=False))
[
  {
    "type": "Feature",
    "geometry": null,
    "properties": {
      "surface": "国立",
      "node_type": "NORMAL",
      "morphemes": {
        "conjugated_form": "名詞-固有名詞-地名語",
        "conjugation_type": "*",
        "original_form": "国立",
        "pos": "名詞",
        "prononciation": "コクリツ",
        "subclass1": "固有名詞",
        "subclass2": "地名修飾語",
        "subclass3": "*",
        "surface": "国立",
        "yomi": "コクリツ"
      }
    }
  }, ... 
  {
    "type": "Feature",
    "geometry": {
      "type": "Point",
      "coordinates": [
        139.753634,
        35.694003
      ]
    },
    "properties": {
      "surface": "千代田区",
      "node_type": "GEOWORD",
      "morphemes": {
        "conjugated_form": "*",
        "conjugation_type": "*",
        "original_form": "千代田区",
        "pos": "名詞",
        "prononciation": "",
        "subclass1": "固有名詞",
        "subclass2": "地名語",
        "subclass3": "WWIY7G:千代田区",
        "surface": "千代田区",
        "yomi": ""
      },
      "geoword_properties": {
        "address": "東京都千代田区",
        "body": "千代田",
        "body_variants": "千代田",
        "code": {},
        "countyname": "",
        "countyname_variants": "",
        "dictionary_id": 1,
        "entry_id": "13101A1968",
        "geolod_id": "WWIY7G",
        "hypernym": [
          "東京都"
        ],
        "latitude": "35.69400300",
        "longitude": "139.75363400",
        "ne_class": "市区町村",
        "prefname": "東京都",
        "prefname_variants": "東京都",
        "source": "1/千代田区役所/千代田区九段南1-2-1/P34-14_13.xml",
        "suffix": [
          "区"
        ],
        "valid_from": "",
        "valid_to": "",
        "dictionary_identifier": "geonlp:geoshape-city"
      }
    }
  },
  {
    "type": "Feature",
    "geometry": null,
    "properties": {
      "surface": "に",
      "node_type": "NORMAL",
      "morphemes": {
        "conjugated_form": "*",
        "conjugation_type": "*",
        "original_form": "に",
        "pos": "助詞",
        "prononciation": "ニ",
        "subclass1": "格助詞",
        "subclass2": "一般",
        "subclass3": "*",
        "surface": "に",
        "yomi": "ニ"
      }
    }
  },...
]

Pre-requirements

pygeonlp requires MeCab C++ library and UTF8 dictionary for Japanese morphological analysis.

Also, the C++ implementation part depends on Boost C++.

$ sudo apt install libmecab-dev mecab-ipadic-utf8 libboost-all-dev

Install

The pygeonlp package can be installed with the pip command. It is recommended that you upgrade pip and setuptools to the latest versions before running it.

$ pip install --upgrade pip setuptools
$ pip install pygeonlp

The database needs to be prepared the first time.

Prepare the database

Execute the command to register the basic place name word analysis dictionaries (*.json, *.csv) in this package into the database under mydic/.

>>> import pygeonlp.api as api
>>> api.setup_basic_database(db_dir='mydic/')

This command registers three dictionaries:

  • "Prefectures of Japan" (geonlp:geoshape-pref),

  • "Historical Administrative Area Data Set Beta Dictionary of Place Names" (geonlp:geoshape-city)

  • "Railroad Stations in Japan (2019)" (geonlp:ksj-station-N02-2019)

Install GDAL library (Optional)

If the GDAL library is installed, pygeonlp can use "spatial distance" for disambiguation when there are multiple place names with the same name, thus improving accuracy. You can also use spatial filters.

$ sudo apt install libgdal-dev
$ pip install gdal

Install jageocoder (Optional)

pygeonlp can use address-geocoding if an address-dictionary for jageocoder is installed.

See the jageocoder documentation for installation instructions.

Run tests (Optional)

Run the unit tests with python setup.py test command.

Uninstall

Use pip command to uninstall.

$ pip uninstall pygeonlp

Delete the database

When you register a place-name word analysis dictionary to the database, it will create a sqlite3 database and some other files in the specified directory.

If you want to delete them, just delete the whole directory.

$ rm -r mydic/

License

The 2-Clause BSD License

Acknowledgements

This software is supported by DIAS (Data Integration and Analysis System) and ROIS-DS CODH (Center for Open Data in the Humanities).

It was also supported by JST (Japan Science and Technology Agency) PRESTO program.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygeonlp-1.2.2rc1.tar.gz (2.0 MB view details)

Uploaded Source

Built Distribution

pygeonlp-1.2.2rc1-cp39-cp39-macosx_10_9_universal2.whl (2.5 MB view details)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file pygeonlp-1.2.2rc1.tar.gz.

File metadata

  • Download URL: pygeonlp-1.2.2rc1.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for pygeonlp-1.2.2rc1.tar.gz
Algorithm Hash digest
SHA256 7b1ef101e53e6b10a77b14b4c32910203ec0eba1e4e77d329b6bec7fcd83fed5
MD5 5fc175e12fd79d28a3e6ef1d1a6e8573
BLAKE2b-256 08df3c9632b882621dce220d8504821f4023f90c094cf8f40c48c2bb64db9976

See more details on using hashes here.

Provenance

File details

Details for the file pygeonlp-1.2.2rc1-cp39-cp39-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for pygeonlp-1.2.2rc1-cp39-cp39-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 1b02934c78cf25c5c9257347b045f69d8fb4c24f1eb7d1f752650accf1c9a7a4
MD5 2c5746042dfa62af8388ed20f958b27a
BLAKE2b-256 f54bc7d4fef23f8697854633f7c8e105f1951b9088f7c11d1b666cdaab7aa4d7

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page