Skip to main content
Python Software Foundation 20th Year Anniversary Fundraiser  Donate today!

Chinese words extraction and new words discovery

Project description

# xinci 新词 & 抽词
xinci is a Python interface for chinese words extraction & new words extraction.

## Requirements
Python >= 2.7

## Installation
### 1. using pip
pip install xinci
### 2. using
``` shell
git clone
cd xinci
pip install

## Usage
This package has two main use cases: words extraction and
find new words.

### 1. command line
cd xinci

### 2. python package
import xinci

# if you want to see logging events.
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s : %(levelname)s : %(message)s')

# init default dictionary or user dic,
dic = xinci.Dictionary()
# load vocab, vocab is a python set.
vocab = dic.load() # or dic.dictionary

# add words to dic
dic.add(['神马']) # or dic.add_from_file('user.dic')
# remove words from dic
dic.remove(['神马']) # or dic.remove_from_file('user.dic')

# extract new words, xc is a set
xc = xinci.extract('corpus.txt')
for w in xc:
# extract all words, c is a set
c = xinci.extract('corpus.txt', all_words=True)
for w in xc:
@新词 @词频
祛斑 13
后再 7
今日头条 9
洗净切 7
蛋液 9
### Notes: Iteratively add "not seems to new words" in result to common dic will improve a lot.

## API documentation
xc = xinci.extract(params)

List of available `params` and their default value:
corpus_file: string, input corpus file (required)
common_words_file: string, common words dic file [common.dic]
min_candidate_len: int, min candidate word length [2]
max_candidate_len: int, max candidate word length [5]
least_cnt_threshold: int, least word count to extract [5]
solid_rate_threshold: float, solid rate threshold [0.018]
entropy_threshold: float, entropy threshold [1.92]
all_words: bool, set True to extract all words mode [False]
save_file: string, output file [None]

## References
The code is based on this java version

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for xinci, version 1.2.0
Filename, size File type Python version Upload date Hashes
Filename, size xinci-1.2.0-py2-none-any.whl (1.0 MB) File type Wheel Python version py2 Upload date Hashes View
Filename, size xinci-1.2.0.tar.gz (1.0 MB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page