Skip to main content

Make Word2Vec from aozorabunko/aozorabunko

Project description

aovec

Model release Release Package PyPI version

pre-commit.ci status

model

Requirements

How to use

  • Make *.model file
# Install from pypi
pip install aovec

# Clone aozorabunko/aozorabunko (>20GB)
aovec clone

# Parse html files and write to results to novels/
aovec parse

# Make word2vec and write to aozora_model.model
aovec mkvec
from gensim.models import Word2Vec, KeyedVectors

# *.model+*.model.syn1neg.npy+*.model.wv.vectors.npy
model = Word2Vec.load('aozora_model.model')

# or...
# *.kv
model = KeyedVectors.load_word2vec_format('aozora_model.kv')

# or...(fastest way to load)
# *.kv.bin
model = KeyedVectors.load_word2vec_format('aozora_model.kv.bin',
                                          binary=True,
                                          unicode_errors='ignore')

(Optional) Set up mecab-ipadic-neologd on Ubuntu

Download and install

sudo apt install build-essential
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd neologd && cd $_
sudo bin/install-mecab-ipadic-neologd -y
sudo mv /usr/lib/*/mecab/dic/mecab-ipadic-neologd /var/lib/mecab/dic

Update /etc/mecabrc

sudo cp /etc/mecabrc /etc/mecabrc.bak
sudo sed -i 's_^dicdir.*_; &\'$'\ndicdir = /var/lib/mecab/dic/mecab-ipadic-neologd_' /etc/mecabrc
--- /etc/mecabrc.bak
+++ /etc/mecabrc
@@ -3,7 +3,8 @@
 ;
 ; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
 ;
-dicdir = /var/lib/mecab/dic/debian
+; dicdir = /var/lib/mecab/dic/debian
+dicdir = /var/lib/mecab/dic/mecab-ipadic-neologd

 ; userdic = /home/foo/bar/user.dic

Help

$ aovec -h
usage: aovec [-h] [-V] {clone,c,parse,p,mkvec,m} ...

Make Word2Vec from aozorabunko/aozorabunko

positional arguments:
  {clone,c,parse,p,mkvec,m}
    clone (c)           clone aozorabunko/aozorabunko (>20GB)
    parse (p)           parse html files and write to results
    mkvec (m)           make word2vec and write to *.model

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
$ aovec clone -h
usage: aovec clone [-h]

optional arguments:
  -h, --help  show this help message and exit
$ aovec parse -h
usage: aovec parse [-h] [-d DIR]

optional arguments:
  -h, --help            show this help message and exit
  -d DIR, --savedir DIR
                        directory name of saving results (default: novels)
$ aovec mkvec -h
usage: aovec mkvec [-h] [-d DIR] [-o NAME] [-e INT] [-v INT] [-m INT] [-w INT]
                   [-p INT] [-b] [--both]

optional arguments:
  -h, --help            show this help message and exit
  -d DIR, --parsedir DIR
                        directory name of saved parsing results (default:
                        novels)
  -o NAME, --model NAME
                        name of word2vec model (default: aozora_model)
  -e INT, --epochs INT  number of word2vec epochs (default: 5)
  -v INT, --vector_size INT
                        dimensionality of the word vectors (default: 1000)
  -m INT, --min_count INT
                        ignore words total frequency lower than this (default:
                        5)
  -w INT, --window INT  window size of words before and for learning (default:
                        5)
  -p INT, --workers INT
                        worker threads (default: 3)
  -b, --binary          save model files as one binary (default: False)
  --both                save model files as both row data and binary (default:
                        False)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aovec-1.2.1.tar.gz (9.0 kB view hashes)

Uploaded Source

Built Distribution

aovec-1.2.1-py3-none-any.whl (9.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page