Skip to main content

Chinese words extraction and new words discovery

Project description

# xinci 新词 & 抽词
xinci is a Python interface for chinese words extraction & new words extraction.
[https://pypi.org/project/xinci/]

## Requirements
Python >= 2.7

## Installation
### 1. using pip
```shell
pip install xinci
```
### 2. using setup.py
``` shell
git clone git@github.com:Lapis-Hong/xinci.git
cd xinci
pip setup.py install
```

## Usage
This package has two main use cases: words extraction and
find new words.

### 1. command line
```shell
cd xinci
python word_extraction.py
```
or
```
./run.sh
```

### 2. python package
```python
import xinci

# if you want to see logging events.
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s : %(levelname)s : %(message)s')

# init default dictionary or user dic,
dic = xinci.Dictionary()
# load vocab, vocab is a python set.
vocab = dic.load() # or dic.dictionary
print(vocab)

# add words to dic
dic.add(['神马']) # or dic.add_from_file('user.dic')
# remove words from dic
dic.remove(['神马']) # or dic.remove_from_file('user.dic')

# extract new words, xc is a set
xc = xinci.extract('corpus.txt')
for w in xc:
print(w)
# extract all words, c is a set
c = xinci.extract('corpus.txt', all_words=True)
for w in xc:
print(w)
```
result
```angular2html
发现5个新词如下:
@新词 @词频
祛斑 13
后再 7
今日头条 9
洗净切 7
蛋液 9
```
### Notes: Iteratively add "not seems to new words" in result to common dic will improve a lot.


## API documentation
```python
xc = xinci.extract(params)

```
List of available `params` and their default value:
```angular2html
corpus_file: string, input corpus file (required)
common_words_file: string, common words dic file [common.dic]
min_candidate_len: int, min candidate word length [2]
max_candidate_len: int, max candidate word length [5]
least_cnt_threshold: int, least word count to extract [5]
solid_rate_threshold: float, solid rate threshold [0.018]
entropy_threshold: float, entropy threshold [1.92]
all_words: bool, set True to extract all words mode [False]
save_file: string, output file [None]
```

## References
The code is based on this java version
[https://github.com/GeorgeBourne/grid]



Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xinci-1.2.0.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

xinci-1.2.0-py2-none-any.whl (1.0 MB view details)

Uploaded Python 2

File details

Details for the file xinci-1.2.0.tar.gz.

File metadata

  • Download URL: xinci-1.2.0.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for xinci-1.2.0.tar.gz
Algorithm Hash digest
SHA256 05076fa33ef32dcd9dd8bc502603df07fddef1ec42c45fdb17e964c5a64b14de
MD5 7e198e411ae5993bafd3e8fc312f7dff
BLAKE2b-256 089180ecc85e199caaeeb90e13c1d5cd79c36cbcb822c40f6d8c9be48e3deee1

See more details on using hashes here.

File details

Details for the file xinci-1.2.0-py2-none-any.whl.

File metadata

File hashes

Hashes for xinci-1.2.0-py2-none-any.whl
Algorithm Hash digest
SHA256 7540bad5163572058a64b857449c829f7849a02c144e834cb29992336cb734e5
MD5 f7ce84bc128009502d8a1a38e7488f45
BLAKE2b-256 b8d9c88909d9f3d891f2ec368f6abc789ac6e10015b3bd5273ae481fb72e5b6f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page