地址标准化
Project description
Geocoding
-
该模块用于将不规范(或者连续)的文本地址进行尽可能的标准化, 以及对两个地址进行相似度的计算
-
该模块为 bitlap/geocoding 项目的Python封装,原项目为Kotlin开发
-
为方便使用Python方法调用,这里使用Python的
jpype
模块将 bitlap/geocoding 进行封装,因此该模块需要Java环境的支持(需要添加JAVA_HOME等环境变量) -
GeocodingCHN
重新加载功能在Windows平台上可能会遇到错误,参考Jpype Changelog 1.2.0 - 2020-11-29 更新信息。 -
安装命令
pip install GeocodingCHN
更新信息:
1.4.3
-
添加
save
方法用于生成自定义的dat字典文件 -
添加
match
方法用于深度优先匹配符合输入的地址信息 -
添加
analyze
方法用于地址切分
1.4.2
修复 无法添加自定义地址问题,并更新jar包至1.3.1
1.4.1
-
GeocodingCHN.Geocoding
新增参数设定(为适配org.bitlap.geocoding.GeocodingX
类)-
新增
data_class_path
参数,支持自定义地址文件路径,要求该路径为文件绝对路径,默认使用内置地址core/region.dat
-
新增
strict
参数,默认False
。当发现没有省和市,且匹配的父项数量等于1时,能成功匹配。-
True
: 严格模式,当发现没有省和市,且匹配的父项数量大于1时,返回None
-
False
: 非严格模式,当发现没有省和市,且匹配的父项数量大于1时,匹配随机一项省和市
-
-
新增
jvm_path
,允许设置JVM路径,但要求该路径为文件绝对路径
-
-
addRegionEntry
方法新增replace
参数,表示是否替换旧地址,默认为True
其他更新:
-
区分
similarityWithResult
与similarity
方法,similarityWithResult
返回MatchedResult类型结果,similarity
返回float类型结果 -
封装分词方法
segment
GeocodingCHN.Geocoding
from GeocodingCHN import Geocoding
geocoding = Geocoding(data_class_path="core/region.dat",
strict= False,
jvm_path= None)
-
data_class_path : 自定义地址文件路径
-
strict : 模式设置
-
jvm_path : JVM路径
GeocodingCHN.Geocoding.normalizing
提供地址标准化
normalizing(address) -> Address
- address: 文本地址
from GeocodingCHN import Geocoding
geocoding = Geocoding()
text = '山东青岛李沧区延川路116号绿城城园东区7号楼2单元802户'
address_nor = geocoding.normalizing(text)
print(address_nor)
Address(
provinceId=370000000000, province=山东省,
cityId=370200000000, city=青岛市,
districtId=370213000000, district=李沧区,
streetId=0, street=,
townId=0, town=,
villageId=0, village=,
road=延川路,
roadNum=116号,
buildingNum=7号楼2单元802户,
text=绿城城园东区
)
GeocodingCHN.Geocoding.similarityWithResult
地址相似度计算,返回详细结果
similarityWithResult(Address1:Address, Address2:Address) -> MatchedResult
-
Address1: 地址1, 由 normalizing 方法返回的 Address 类
-
Address2: 地址2, 由 normalizing 方法返回的 Address 类
from GeocodingCHN import Geocoding
geocoding = Geocoding()
text1 = '山东青岛李沧区延川路116号绿城城园东区7号楼2单元802户'
text2 = '山东青岛李沧区延川路绿城城园东区7-2-802'
Address_1 = geocoding.normalizing(text1)
Address_2 = geocoding.normalizing(text2)
print(geocoding.similarityWithResult(Address_1, Address_2))
MatchedResult(
doc1=Document(terms=[Term(延川路), Term(116号), Term(7), Term(2), Term(802), Term(绿城), Term(城), Term(园), Term(东区)], town=None, village=None, road=Term(延川路), roadNum=Term(116号), roadNumValue=116),
doc2=Document(terms=[Term(延川路), Term(7), Term(2), Term(802), Term(绿城), Term(城), Term(园), Term(东区)], town=None, village=None, road=Term(延川路), roadNum=None, roadNumValue=0),
terms=['MatchedTerm(Term(延川路), coord=-1.0, density=-1.0, boost=2.0, tfidf=8.0)', 'MatchedTerm(Term(7), coord=-1.0, density=-1.0, boost=1.0, tfidf=2.0)', 'MatchedTerm(Term(2), coord=-1.0, density=-1.0, boost=1.0, tfidf=2.0)', 'MatchedTerm(Term(802), coord=-1.0, density=-1.0, boost=1.0, tfidf=2.0)', 'MatchedTerm(Term(绿城), coord=1.0, density=1.0, boost=1.0, tfidf=4.0)', 'MatchedTerm(Term(城), coord=1.0, density=1.0, boost=1.0, tfidf=4.0)', 'MatchedTerm(Term(园), coord=1.0, density=1.0, boost=1.0, tfidf=4.0)', 'MatchedTerm(Term(东区), coord=1.0, density=1.0, boost=1.0, tfidf=4.0)'],
similarity=0.9473309334313418
)
GeocodingCHN.Geocoding.similarity
地址相似度计算
similarityWithResult(Address1:[Address, str], Address2:[Address, str])
-
Address1: 地址1, Address类 或 文本
-
Address2: 地址2, Address类 或 文本
from GeocodingCHN import Geocoding
geocoding = Geocoding()
text1 = '山东青岛李沧区延川路116号绿城城园东区7号楼2单元802户'
text2 = '山东青岛李沧区延川路绿城城园东区7-2-802'
Address_1 = geocoding.normalizing(text1)
Address_2 = geocoding.normalizing(text2)
print(geocoding.similarity(Address_1, Address_2))
0.9473309334313418
GeocodingCHN.Geocoding.addRegionEntry
添加自定义地址
addRegionEntry(Id, parentId, name, RegionType, alias='', replace=True) -> bool
-
Id: 地址的ID
-
parentId: 地址的父ID, 必须存在
-
name: 地址的名称
-
RegionType: RegionType,地址类型
-
alias: 地址的别名, default=''
-
replace: 是否替换旧地址, default=True
from GeocodingCHN import Geocoding
geocoding = Geocoding()
geocoding.addRegionEntry(1, 321200000000, "A街道", geocoding.RegionType.Street)
address_nor = geocoding.normalizing("江苏泰州A街道")
print(address_nor)
Address(
provinceId=320000000000, province=江苏省,
cityId=321200000000, city=泰州市,
districtId=321200000000, district=泰州市,
streetId=1, street=A街道,
townId=0, town=,
villageId=0, village=,
road=,
roadNum=,
buildingNum=,
text=
)
GeocodingCHN.Geocoding.segment
分词
segment(text: str, seg_type: str = 'ik') -> list
-
text: 输入
-
seg_type: 支持 ['ik', 'simple', 'smart', 'word'],default = 'ik'
from GeocodingCHN import Geocoding
geocoding = Geocoding()
text = '山东青岛李沧区延川路绿城城园东区7-2-802'
print(geocoding.segment(text))
['山东', '青岛', '李沧区', '延川路', '绿城', '城', '园', '东区', '7-2-802']
感谢
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file GeocodingCHN-1.4.3.tar.gz
.
File metadata
- Download URL: GeocodingCHN-1.4.3.tar.gz
- Upload date:
- Size: 9.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9eda97883db27db6b4f0ce89470ddf578acee5c94509d772aefc6ebc98764dd1 |
|
MD5 | 1f2e1b476e69488033c929c996477b1e |
|
BLAKE2b-256 | c239bea497109bdb1f2af279f300f66ff1e6ec1606c8bdd86f6ee8f80dd7b99e |
File details
Details for the file GeocodingCHN-1.4.3-py3-none-any.whl
.
File metadata
- Download URL: GeocodingCHN-1.4.3-py3-none-any.whl
- Upload date:
- Size: 9.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3fdbae9506ef54f7bf8da03bf709d97d395da20afd5ed2d3e522d69fbe8ebfb2 |
|
MD5 | 90e021ecc82ac72b11c007bb78c21c8a |
|
BLAKE2b-256 | b1cad1556815c29f7a073520ff0e68772aad6eafd876073434316b557edbd249 |