Counting word frequency based on Nagao algorithm
Project description
Nagao
An implementation of the paper: A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese
Install
Two ways to install Nagao:
Install Nagao from Pypi:
pip install nagao
Install Nagao from the Github source:
git clone https://github.com/Chiang97912/nagao.git
cd nagao
python setup.py install
Usage
You can use Nagao in Python file:
from nagao import Nagao
nagao = Nagao(lang='en', min_ngram=2, max_ngram=6, min_freq=5, min_lrc=2, min_lre=0.5, min_pmi=0, min_eta=0, threshold=0,\
use_disk=True, use_db=True, lower=True, clean=True, verbose=True)
ts = time.time()
nagao.process('path/to/corpus/file')
nagao.save('path/to/output/file')
print('total spend:', time.time() - ts)
From the command line, you can run:
nagao -c "path/to/corpus/file" -o "path/to/output/file" -l zh --clean --verbose
You can use nagao --help to find the usage of nagao cli:
Options:
-c, --corpus TEXT Corpus file path.
-o, --output TEXT Output file path.
-l, --lang TEXT Corpus language.
-minn, --min_ngram INTEGER Minimum n-gram size.
-maxn, --max_ngram INTEGER Max n-gram size.
--min_freq INTEGER Minimum frequency of word.
--min_lrc INTEGER Minimum count between left and right neighbor.
--min_lre FLOAT Minimum entropy between left and right neighbor.
--min_pmi FLOAT Minimum pmi(pointwise mutual information).
--min_eta FLOAT Minimum balanced value for left and right
neighbor count.
--threshold FLOAT Minimum word probability.
--stopwords TEXT Stopword file path.
--punctuations TEXT Punctuation file path.
--lower If use lower option, keep lowered dictionary.
--clean If use clean option, the cache file will be
cleaned at the end of the program.
--verbose If use verbose option, logs will be displayed on
the terminal.
--help Show this message and exit.
Dependencies
-
Pythonversion 3.6 -
nltkversion 3.5
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nagao-0.1.1.tar.gz.
File metadata
- Download URL: nagao-0.1.1.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/30.0 requests/2.25.0 requests-toolbelt/0.9.1 urllib3/None tqdm/4.62.3 importlib-metadata/4.8.1 keyring/21.4.0 rfc3986/1.4.0 colorama/0.4.4 CPython/3.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e51f9f08b22b46911cba3bdf00ce67474837d059f12bc9dcfae98983cf07c629
|
|
| MD5 |
65b4736f0999e355f96ba195d6e123e1
|
|
| BLAKE2b-256 |
f273ddefdb045d723e2e6d3a9bf45ac0adf8dd21317c5fc958472e3b703231d7
|
File details
Details for the file nagao-0.1.1-py2.py3-none-any.whl.
File metadata
- Download URL: nagao-0.1.1-py2.py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/30.0 requests/2.25.0 requests-toolbelt/0.9.1 urllib3/None tqdm/4.62.3 importlib-metadata/4.8.1 keyring/21.4.0 rfc3986/1.4.0 colorama/0.4.4 CPython/3.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad139b3ad774abf8a0bbfe2b2650e73482442e7ada6b2a98b460b3fa779cf727
|
|
| MD5 |
5a84a3b9f416bc03f2fa2b71755747ce
|
|
| BLAKE2b-256 |
a94aeeb2367025a1310f2c8546056802a5a785d0ee95a6a2f20805b85f73d76d
|