Skip to main content

A wrapper for jieba segmentation

Project description

Segmentation wrapper of Jieba Chinese segmentation.
pip install segjb (dependency: jieba)
  • Lazy initialization.
  • Initialization with user defined dict.
  • Build-in stop-words dict, punctuations dict.
  • Output control of stopwords, punctuations, minimum word length, output delimiters etc..
  • Support ngram.

API

init(stopwords_file, puncs_file, user_dict, silent, main_dict, thread)
– Initialize the segmentation utility instance.
  • return: void.
  • stopwords_file: stopword dictionary. Use “” to disable. [SegJb.DEFAULT_STPW]
  • puncs_file: punctuation dictionary. Use “” to disable. [SegJb.DEFAULT_PUNC]
  • user_dict: load user customized dictionary. Use “” to disable. [SegJb.DEFAULT_DICT]
  • silent: whether print initializing log. [True]
  • thread: number of part to separate the corpus for parallel. [1]

set_param(delim, min_word_len, ngram, keep_stopwords, keep_puncs)
– Set one or more parameters of the segmentation utility instance. Refer to parameter description.
  • return: void

cut2list(corp)
– Cut a sentence to list due to configuration.
  • return: list
  • corp: unicode or utf8 sentence.

cut2str(corp)
– Cut a sentence to a delimeter(can be set by set_param) joined string.
  • return: unicode string.
  • corp: unicode or utf8 sentence.

Parameters

  • delim [’ ‘]
    the delimeter used to constuct the segmentation result in string.
  • min_word_len [1]
    word with length less than min_word_len will not in segmentation result.
  • ngram [1]
    result can be ngram.
  • keep_stopwords [True]
    whether to keep stopwords in result.
  • keep_puncs [True]
    whether to keep stopwords in result.

Example:

from segjb import SegJb
hdl_seg = SegJb()
hdl_seg.init()
hdl_seg.set_param(delim=' ', ngram=2, keep_stopwords=True, keep_puncs=False)
print hdl_seg.cut2str('这是一场精彩的比赛')

Reference:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

segjb-1.0.tar.gz (5.9 MB view details)

Uploaded Source

Built Distribution

segjb-1.0-py2.py3-none-any.whl (6.0 MB view details)

Uploaded Python 2Python 3

File details

Details for the file segjb-1.0.tar.gz.

File metadata

  • Download URL: segjb-1.0.tar.gz
  • Upload date:
  • Size: 5.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for segjb-1.0.tar.gz
Algorithm Hash digest
SHA256 9547cf4bca2e9c6023c9ffd365e3a47988a2b1be1d1911231af8d87e6790e4b0
MD5 6c6d0c2cfda9709e76da56a65ae95331
BLAKE2b-256 1b348b3cfeeda727afdfd99d95343a7fe40bab802f4852398b9ca1be751c9cee

See more details on using hashes here.

File details

Details for the file segjb-1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: segjb-1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for segjb-1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 fe3d89891aef3723f7edfdb214475c5dbbb2aeb9738a6bf9109261fc77962244
MD5 b9c798e485e8e3ff723cb52899922a9a
BLAKE2b-256 024a99025719efafe7cff864f24be592a529593b279cbdec029a5bab4f1b4aef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page