Counting word frequency based on Nagao algorithm
Project description
Nagao
An implementation of the paper: A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese
Install
Two ways to install Nagao:
Install Nagao from Pypi:
pip install nagao
Install Nagao from the Github source:
git clone https://github.com/Chiang97912/nagao.git
cd nagao
python setup.py install
Usage
You can use Nagao in Python file:
from nagao import Nagao
nagao = Nagao(lang='en', min_ngram=2, max_ngram=6, min_freq=5, min_lrc=2, min_lre=0.5, min_pmi=0, min_eta=0, threshold=0,\
use_disk=True, use_db=True, lower=True, clean=True, verbose=True)
ts = time.time()
nagao.process('path/to/corpus/file')
nagao.save('path/to/output/file')
print('total spend:', time.time() - ts)
From the command line, you can run:
nagao -c "path/to/corpus/file" -o "path/to/output/file" -l zh --clean --verbose
You can use nagao --help
to find the usage of nagao cli:
Options:
-c, --corpus TEXT Corpus file path.
-o, --output TEXT Output file path.
-l, --lang TEXT Corpus language.
-minn, --min_ngram INTEGER Minimum n-gram size.
-maxn, --max_ngram INTEGER Max n-gram size.
--min_freq INTEGER Minimum frequency of word.
--min_lrc INTEGER Minimum count between left and right neighbor.
--min_lre FLOAT Minimum entropy between left and right neighbor.
--min_pmi FLOAT Minimum pmi(pointwise mutual information).
--min_eta FLOAT Minimum balanced value for left and right
neighbor count.
--threshold FLOAT Minimum word probability.
--stopwords TEXT Stopword file path.
--punctuations TEXT Punctuation file path.
--lower If use lower option, keep lowered dictionary.
--clean If use clean option, the cache file will be
cleaned at the end of the program.
--verbose If use verbose option, logs will be displayed on
the terminal.
--help Show this message and exit.
Dependencies
-
Python
version 3.6 -
nltk
version 3.5
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nagao-0.1.1.tar.gz
(11.7 kB
view details)
Built Distribution
nagao-0.1.1-py2.py3-none-any.whl
(11.7 kB
view details)
File details
Details for the file nagao-0.1.1.tar.gz
.
File metadata
- Download URL: nagao-0.1.1.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/30.0 requests/2.25.0 requests-toolbelt/0.9.1 urllib3/None tqdm/4.62.3 importlib-metadata/4.8.1 keyring/21.4.0 rfc3986/1.4.0 colorama/0.4.4 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e51f9f08b22b46911cba3bdf00ce67474837d059f12bc9dcfae98983cf07c629 |
|
MD5 | 65b4736f0999e355f96ba195d6e123e1 |
|
BLAKE2b-256 | f273ddefdb045d723e2e6d3a9bf45ac0adf8dd21317c5fc958472e3b703231d7 |
File details
Details for the file nagao-0.1.1-py2.py3-none-any.whl
.
File metadata
- Download URL: nagao-0.1.1-py2.py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/30.0 requests/2.25.0 requests-toolbelt/0.9.1 urllib3/None tqdm/4.62.3 importlib-metadata/4.8.1 keyring/21.4.0 rfc3986/1.4.0 colorama/0.4.4 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad139b3ad774abf8a0bbfe2b2650e73482442e7ada6b2a98b460b3fa779cf727 |
|
MD5 | 5a84a3b9f416bc03f2fa2b71755747ce |
|
BLAKE2b-256 | a94aeeb2367025a1310f2c8546056802a5a785d0ee95a6a2f20805b85f73d76d |