Skip to main content

MMseg中文分词 Chinese Segment On MMSeg Algorithm

Project description

MMseg中文分词 Chinese Segment On MMSeg Algorithm
-------------------------------
original edition

pymmseg-cpp
by pluskid
http://code.google.com/p/pymmseg-cpp/

This package is Chinese Segment , I think only chinese need it, so the description is chinese .

If you have interesting , have a look a the original edition

-------------------------------

全文索引用,配合 xapian ( http://xapian.org/ ) 可以很方便的做全文索引

~:python -m mmseg.search
----------
哈尔罗杰历险记(套)
哈尔
罗杰
历险
历险记
----------
卡拉马佐夫兄弟
卡拉

佐夫
兄弟
----------
银河英雄传说
银河
英雄
传说
银河英雄传说
----------
张无忌在光明顶
无忌
张无忌
光明
光明顶
----------
韦帅望的江湖(Ⅲ众望所归)
韦帅
帅望
韦帅望
江湖
众望
望所
所归
众望所归
----------
少年韦帅望之童年结束了
少年
韦帅
帅望
望之
韦帅望之
童年
结束
----------
    晋江文学网站驻站作家,已出版多部作品。
晋江
文学
网站
文学网站
驻站
作家
出版
多部
作品
-------------------------------
分词用,适用于聚类等等

from mmseg import seg_txt
for i in seg_txt("最主要的更动是:张无忌最后没有选定自己的配偶。"):
print i

-------------------------------
配合xapian做索引

#coding:utf-8
#!/usr/bin/env python

import xapian
import sys
import string
from collections import defaultdict

from mmseg.search import seg_txt_search,seg_txt_2_dict

import xapian
SEARCH_DB = xapian.WritableDatabase(DBPATH, xapian.DB_CREATE_OR_OPEN)
SEARCH_ENQUIRE = xapian.Enquire(SEARCH_DB)

def index_txt(id, txt):
doc = xapian.Document()
for word, value in seg_txt_2_dict(txt).iteritems():
doc.add_term(word, value)
key = ":%s"%id
doc.add_term(key)
SEARCH_DB.replace_document(key, doc)


def flush_db():
SEARCH_DB.flush()

if __name__ == "__main__":
txt = """
治安署地最高长官站在街头,皱眉看着一队近卫军飞快地走过,他心中满是疑惑,立刻回到了治安署里地办公室,然后喊来了自己地一个部下,让他立刻去军方统帅部请示一下.
"""

index_txt(1, txt)
flush_db()

-------------------------------
配合xapian做搜索

#coding:utf-8
from mmseg.search import seg_txt_search,seg_txt_2_dict

import xapian
SEARCH_DB = xapian.WritableDatabase(DBPATH, xapian.DB_CREATE_OR_OPEN)
SEARCH_ENQUIRE = xapian.Enquire(SEARCH_DB)

def search(keywords, offset=0, limit=35, enquire=SEARCH_ENQUIRE):
query_list = []
for word, value in seg_txt_2_dict(keywords).iteritems():
query = xapian.Query(word, value)
query_list.append(query)
if len(query_list) != 1:
query = xapian.Query(xapian.Query.OP_AND, query_list)
else:
query = query_list[0]

enquire.set_query(query)
matches = enquire.get_mset(offset, limit, None)
return matches

if __name__ == "__main__":
matches = search( "治安")

# Display the results.
print "%i results found." % matches.get_matches_estimated()
print "Results 1-%i:" % matches.size()

for m in matches:
print "%i: %i%% docid=%i [%s]" % (m.rank + 1, m.percent, m.docid, m.document.get_data())
-------------------------------
张沈鹏(zsp007@gmail.com) 修改版 rmmseg-cpp

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmseg-1.3.0.tar.gz (817.4 kB view details)

Uploaded Source

File details

Details for the file mmseg-1.3.0.tar.gz.

File metadata

  • Download URL: mmseg-1.3.0.tar.gz
  • Upload date:
  • Size: 817.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for mmseg-1.3.0.tar.gz
Algorithm Hash digest
SHA256 f8878cddde0e96b7c70ff457edf662e6741716e71723a46a08b9efdcf9e3542d
MD5 ebf97c3d1cc541d0a2241f87174734d0
BLAKE2b-256 f8313bc9205f39cc8ab37193a6fbb24693993b2f305aba9f35b09fad882107ee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page