Skip to main content

地址标准化

Project description

Geocoding

Mac Linux Windows

PypiVersion JarVersion Python wheels

  • 该模块用于将不规范(或者连续)的文本地址进行尽可能的标准化, 以及对两个地址进行相似度的计算
  • 该模块为 bitlap/geocoding 项目的Python封装,原项目为Kotlin开发
  • 为方便使用Python方法调用,这里使用Python的jpype模块将 bitlap/geocoding 进行封装,因此该模块需要Java环境的支持(需要添加JAVA_HOME等环境变量)
  • GeocodingCHN重新加载功能在Windows平台上可能会遇到错误,参考Jpype Changelog 1.2.0 - 2020-11-29 更新信息。
  • 安装命令 pip install GeocodingCHN

更新信息:

1.4.5

  1. 修复MatchedResult无法解析空结果的问题

1.4.4

  1. 修复无法创建Address实例的问题

1.4.3

  1. 添加save方法用于生成自定义的dat字典文件
  2. 添加match方法用于深度优先匹配符合输入的地址信息
  3. 添加analyze方法用于地址切分

1.4.2

修复 无法添加自定义地址问题,并更新jar包至1.3.1

1.4.1

原项目更新jar包,并适配新增功能。 新增功能

  • GeocodingCHN.Geocoding新增参数设定(为适配org.bitlap.geocoding.GeocodingX类)
    • 新增data_class_path参数,支持自定义地址文件路径,要求该路径为文件绝对路径,默认使用内置地址core/region.dat
    • 新增strict参数,默认 False。当发现没有省和市,且匹配的父项数量等于1时,能成功匹配。
      • True: 严格模式,当发现没有省和市,且匹配的父项数量大于1时,返回 None
      • False: 非严格模式,当发现没有省和市,且匹配的父项数量大于1时,匹配随机一项省和市
    • 新增jvm_path,允许设置JVM路径,但要求该路径为文件绝对路径
  • addRegionEntry 方法新增 replace 参数,表示是否替换旧地址,默认为True

其他更新:

  • 区分 similarityWithResultsimilarity 方法,similarityWithResult 返回MatchedResult类型结果,similarity 返回float类型结果
  • 封装分词方法 segment

GeocodingCHN.Geocoding

from GeocodingCHN import Geocoding
geocoding = Geocoding(data_class_path="core/region.dat",
                      strict= False, 
                      jvm_path= None)
  • data_class_path : 自定义地址文件路径
  • strict : 模式设置
  • jvm_path : JVM路径

GeocodingCHN.Geocoding.normalizing

提供地址标准化

normalizing(address) -> Address

  • address: 文本地址
from GeocodingCHN import Geocoding
geocoding = Geocoding()
text =  '山东青岛李沧区延川路116号绿城城园东区7号楼2单元802户'
address_nor = geocoding.normalizing(text)
print(address_nor)
Address(
	provinceId=370000000000, province=山东省, 
	cityId=370200000000, city=青岛市, 
	districtId=370213000000, district=李沧区, 
	streetId=0, street=, 
	townId=0, town=, 
	villageId=0, village=, 
	road=延川路, 
	roadNum=116号, 
	buildingNum=7号楼2单元802户, 
	text=绿城城园东区
)

GeocodingCHN.Geocoding.similarityWithResult

地址相似度计算,返回详细结果

similarityWithResult(Address1:Address, Address2:Address) -> MatchedResult

  • Address1: 地址1, 由 normalizing 方法返回的 Address 类
  • Address2: 地址2, 由 normalizing 方法返回的 Address 类
from GeocodingCHN import Geocoding
geocoding = Geocoding()
text1 = '山东青岛李沧区延川路116号绿城城园东区7号楼2单元802户'
text2 = '山东青岛李沧区延川路绿城城园东区7-2-802'
Address_1 = geocoding.normalizing(text1)
Address_2 = geocoding.normalizing(text2)
print(geocoding.similarityWithResult(Address_1, Address_2))
MatchedResult(
	doc1=Document(terms=[Term(延川路), Term(116号), Term(7), Term(2), Term(802), Term(绿城), Term(城), Term(园), Term(东区)], town=None, village=None, road=Term(延川路), roadNum=Term(116号), roadNumValue=116), 
	doc2=Document(terms=[Term(延川路), Term(7), Term(2), Term(802), Term(绿城), Term(城), Term(园), Term(东区)], town=None, village=None, road=Term(延川路), roadNum=None, roadNumValue=0), 
	terms=['MatchedTerm(Term(延川路), coord=-1.0, density=-1.0, boost=2.0, tfidf=8.0)', 'MatchedTerm(Term(7), coord=-1.0, density=-1.0, boost=1.0, tfidf=2.0)', 'MatchedTerm(Term(2), coord=-1.0, density=-1.0, boost=1.0, tfidf=2.0)', 'MatchedTerm(Term(802), coord=-1.0, density=-1.0, boost=1.0, tfidf=2.0)', 'MatchedTerm(Term(绿城), coord=1.0, density=1.0, boost=1.0, tfidf=4.0)', 'MatchedTerm(Term(城), coord=1.0, density=1.0, boost=1.0, tfidf=4.0)', 'MatchedTerm(Term(园), coord=1.0, density=1.0, boost=1.0, tfidf=4.0)', 'MatchedTerm(Term(东区), coord=1.0, density=1.0, boost=1.0, tfidf=4.0)'], 
	similarity=0.9473309334313418
)

GeocodingCHN.Geocoding.similarity

地址相似度计算

similarityWithResult(Address1:[Address, str], Address2:[Address, str])

  • Address1: 地址1, Address类 或 文本
  • Address2: 地址2, Address类 或 文本
from GeocodingCHN import Geocoding
geocoding = Geocoding()
text1 = '山东青岛李沧区延川路116号绿城城园东区7号楼2单元802户'
text2 = '山东青岛李沧区延川路绿城城园东区7-2-802'
Address_1 = geocoding.normalizing(text1)
Address_2 = geocoding.normalizing(text2)
print(geocoding.similarity(Address_1, Address_2))
0.9473309334313418

GeocodingCHN.Geocoding.addRegionEntry

添加自定义地址

addRegionEntry(Id, parentId, name, RegionType, alias='', replace=True) -> bool

  • Id: 地址的ID
  • parentId: 地址的父ID, 必须存在
  • name: 地址的名称
  • RegionType: RegionType,地址类型
  • alias: 地址的别名, default=''
  • replace: 是否替换旧地址, default=True
from GeocodingCHN import Geocoding
geocoding = Geocoding()
geocoding.addRegionEntry(1, 321200000000, "A街道", geocoding.RegionType.Street)
address_nor = geocoding.normalizing("江苏泰州A街道")
print(address_nor)
Address(
	provinceId=320000000000, province=江苏省, 
	cityId=321200000000, city=泰州市, 
	districtId=321200000000, district=泰州市, 
	streetId=1, street=A街道, 
	townId=0, town=, 
	villageId=0, village=, 
	road=, 
	roadNum=, 
	buildingNum=, 
	text=
)

GeocodingCHN.Geocoding.segment

分词

segment(text: str, seg_type: str = 'ik') -> list

  • text: 输入
  • seg_type: 支持 ['ik', 'simple', 'smart', 'word'],default = 'ik'
from GeocodingCHN import Geocoding
geocoding = Geocoding()
text = '山东青岛李沧区延川路绿城城园东区7-2-802'
print(geocoding.segment(text))
['山东', '青岛', '李沧区', '延川路', '绿城', '城', '园', '东区', '7-2-802']

感谢

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GeocodingCHN-1.4.5.tar.gz (9.1 MB view details)

Uploaded Source

Built Distribution

GeocodingCHN-1.4.5-py3-none-any.whl (9.1 MB view details)

Uploaded Python 3

File details

Details for the file GeocodingCHN-1.4.5.tar.gz.

File metadata

  • Download URL: GeocodingCHN-1.4.5.tar.gz
  • Upload date:
  • Size: 9.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.4

File hashes

Hashes for GeocodingCHN-1.4.5.tar.gz
Algorithm Hash digest
SHA256 507411dbf2192d059e258200fa5bff1a4300291c1d6172f011307967e89e11fd
MD5 b9cc0585ce6e8bd15367b32f2ab350b1
BLAKE2b-256 573cc4db51628e423ba39cb2f940e606922a021ebb63cb46f009a5049761e069

See more details on using hashes here.

File details

Details for the file GeocodingCHN-1.4.5-py3-none-any.whl.

File metadata

File hashes

Hashes for GeocodingCHN-1.4.5-py3-none-any.whl
Algorithm Hash digest
SHA256 f4f304f879a9e069b20b197587c5d484cbbc784deb7b16e4c7f7ba5bca7ebdc6
MD5 93360a2294806087d585f3ea402262f5
BLAKE2b-256 c349b43c0b497e3ba1378ef6712bc6996d539da489e74761b9b34f121709344d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page