Skip to main content

Chinese words extraction and new words discovery

Project description

# xinci 新词 & 抽词
xinci is a Python interface for chinese words extraction & new words extraction.
[https://pypi.org/project/xinci/]

## Requirements
Python >= 2.7

## Installation
### 1. using pip
```shell
pip install xinci
```
### 2. using setup.py
``` shell
git clone git@github.com:Lapis-Hong/xinci.git
cd xinci
pip setup.py install
```

## Usage
This package has two main use cases: words extraction and
find new words.

### 1. command line
```shell
cd xinci
python word_extraction.py
```
or
```
./run.sh
```

### 2. python package
```python
import xinci

# if you want to see logging events.
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s : %(levelname)s : %(message)s')

# init default dictionary or user dic,
dic = xinci.Dictionary()
# load vocab, vocab is a python set.
vocab = dic.load() # or dic.dictionary
print(vocab)

# add words to dic
dic.add(['神马']) # or dic.add_from_file('user.dic')
# remove words from dic
dic.remove(['神马']) # or dic.remove_from_file('user.dic')

# extract new words, xc is a set
xc = xinci.extract('corpus.txt')
for w in xc:
print(w)
# extract all words, c is a set
c = xinci.extract('corpus.txt', all_words=True)
for w in xc:
print(w)
```
result
```angular2html
发现5个新词如下:
@新词 @词频
祛斑 13
后再 7
今日头条 9
洗净切 7
蛋液 9
```
### Notes: Iteratively add "not seems to new words" in result to common dic will improve a lot.


## API documentation
```python
xc = xinci.extract(params)

```
List of available `params` and their default value:
```angular2html
corpus_file: string, input corpus file (required)
common_words_file: string, common words dic file [common.dic]
min_candidate_len: int, min candidate word length [2]
max_candidate_len: int, max candidate word length [5]
least_cnt_threshold: int, least word count to extract [5]
solid_rate_threshold: float, solid rate threshold [0.018]
entropy_threshold: float, entropy threshold [1.92]
all_words: bool, set True to extract all words mode [False]
save_file: string, output file [None]
```

## References
The code is based on this java version
[https://github.com/GeorgeBourne/grid]



Project details


Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
xinci-1.2.0-py2-none-any.whl (1.0 MB) Copy SHA256 hash SHA256 Wheel py2
xinci-1.2.0.tar.gz (1.0 MB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page