Skip to main content

Chinese words extraction and new words discovery

Project description

# xinci 新词 & 抽词
xinci is a Python interface for chinese words extraction & new words extraction.

## Requirements
Python >= 2.7

## Installation
### 1. using pip
pip install xinci
### 2. using
``` shell
git clone
cd xinci
pip install

## Usage
This package has two main use cases: words extraction and
find new words.

### 1. command line
cd xinci

### 2. python package
import xinci

# if you want to see logging events.
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s : %(levelname)s : %(message)s')

# init default dictionary or user dic,
dic = xinci.Dictionary()
# load vocab, vocab is a python set.
vocab = dic.load() # or dic.dictionary

# add words to dic
dic.add(['神马']) # or dic.add_from_file('user.dic')
# remove words from dic
dic.remove(['神马']) # or dic.remove_from_file('user.dic')

# extract new words, xc is a set
xc = xinci.extract('corpus.txt')
for w in xc:
# extract all words, c is a set
c = xinci.extract('corpus.txt', all_words=True)
for w in xc:
@新词 @词频
祛斑 13
后再 7
今日头条 9
洗净切 7
蛋液 9
### Notes: Iteratively add "not seems to new words" in result to common dic will improve a lot.

## API documentation
xc = xinci.extract(params)

List of available `params` and their default value:
corpus_file: string, input corpus file (required)
common_words_file: string, common words dic file [common.dic]
min_candidate_len: int, min candidate word length [2]
max_candidate_len: int, max candidate word length [5]
least_cnt_threshold: int, least word count to extract [5]
solid_rate_threshold: float, solid rate threshold [0.018]
entropy_threshold: float, entropy threshold [1.92]
all_words: bool, set True to extract all words mode [False]
save_file: string, output file [None]

## References
The code is based on this java version

Project details

Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
xinci-1.2.0-py2-none-any.whl (1.0 MB) Copy SHA256 hash SHA256 Wheel py2
xinci-1.2.0.tar.gz (1.0 MB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page