Skip to main content

Counting word frequency based on Nagao algorithm

Project description

Nagao

An implementation of the paper: A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

Install

Two ways to install Nagao:

Install Nagao from Pypi:

pip install nagao

Install Nagao from the Github source:

git clone https://github.com/Chiang97912/nagao.git
cd nagao
python setup.py install

Usage

You can use Nagao in Python file:

from nagao import Nagao

nagao = Nagao(lang='en', min_ngram=2, max_ngram=6, min_freq=5, min_lrc=2, min_lre=0.5, min_pmi=0, min_eta=0, threshold=0,\
              use_disk=True, use_db=True, lower=True, clean=True, verbose=True)
ts = time.time()
nagao.process('path/to/corpus/file')
nagao.save('path/to/output/file')
print('total spend:', time.time() - ts)

From the command line, you can run:

nagao -c "path/to/corpus/file" -o "path/to/output/file" -l zh --clean --verbose

You can use nagao --help to find the usage of nagao cli:

Options:
  -c, --corpus TEXT           Corpus file path.
  -o, --output TEXT           Output file path.
  -l, --lang TEXT             Corpus language.
  -minn, --min_ngram INTEGER  Minimum n-gram size.
  -maxn, --max_ngram INTEGER  Max n-gram size.
  --min_freq INTEGER          Minimum frequency of word.
  --min_lrc INTEGER           Minimum count between left and right neighbor.
  --min_lre FLOAT             Minimum entropy between left and right neighbor.
  --min_pmi FLOAT             Minimum pmi(pointwise mutual information).
  --min_eta FLOAT             Minimum balanced value for left and right
                              neighbor count.

  --threshold FLOAT           Minimum word probability.
  --stopwords TEXT            Stopword file path.
  --punctuations TEXT         Punctuation file path.
  --lower                     If use lower option, keep lowered dictionary.
  --clean                     If use clean option, the cache file will be
                              cleaned at the end of the program.
  --verbose                   If use verbose option, logs will be displayed on
                              the terminal.

  --help                      Show this message and exit.

Dependencies

  • Python version 3.6

  • nltk version 3.5

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nagao-0.1.1.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

nagao-0.1.1-py2.py3-none-any.whl (11.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file nagao-0.1.1.tar.gz.

File metadata

  • Download URL: nagao-0.1.1.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/30.0 requests/2.25.0 requests-toolbelt/0.9.1 urllib3/None tqdm/4.62.3 importlib-metadata/4.8.1 keyring/21.4.0 rfc3986/1.4.0 colorama/0.4.4 CPython/3.6.8

File hashes

Hashes for nagao-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e51f9f08b22b46911cba3bdf00ce67474837d059f12bc9dcfae98983cf07c629
MD5 65b4736f0999e355f96ba195d6e123e1
BLAKE2b-256 f273ddefdb045d723e2e6d3a9bf45ac0adf8dd21317c5fc958472e3b703231d7

See more details on using hashes here.

File details

Details for the file nagao-0.1.1-py2.py3-none-any.whl.

File metadata

  • Download URL: nagao-0.1.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/30.0 requests/2.25.0 requests-toolbelt/0.9.1 urllib3/None tqdm/4.62.3 importlib-metadata/4.8.1 keyring/21.4.0 rfc3986/1.4.0 colorama/0.4.4 CPython/3.6.8

File hashes

Hashes for nagao-0.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ad139b3ad774abf8a0bbfe2b2650e73482442e7ada6b2a98b460b3fa779cf727
MD5 5a84a3b9f416bc03f2fa2b71755747ce
BLAKE2b-256 a94aeeb2367025a1310f2c8546056802a5a785d0ee95a6a2f20805b85f73d76d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page