Skip to main content

Make Word2Vec from aozorabunko/aozorabunko

Project description

aovec

Model release Release Package PyPI version

pre-commit.ci status

model

Requirements

How to use

  • Make *.model file
# Install from pypi
pip install aovec

# Clone aozorabunko/aozorabunko (>20GB)
aovec clone

# Parse html files and write to results to novels/
aovec parse

# Make word2vec and write to aozora_model.model
aovec mkvec
from gensim.models import Word2Vec, KeyedVectors

# *.model+*.model.syn1neg.npy+*.model.wv.vectors.npy
model = Word2Vec.load('aozora_model.model')

# or...
# *.kv
model = KeyedVectors.load_word2vec_format('aozora_model.kv')

# or...(fastest way to load)
# *.kv.bin
model = KeyedVectors.load_word2vec_format('aozora_model.kv.bin',
                                          binary=True,
                                          unicode_errors='ignore')

(Optional) Set up mecab-ipadic-neologd on Ubuntu

Download and install

sudo apt install build-essential
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd neologd && cd $_
sudo bin/install-mecab-ipadic-neologd -y
sudo mv /usr/lib/*/mecab/dic/mecab-ipadic-neologd /var/lib/mecab/dic

Update /etc/mecabrc

sudo cp /etc/mecabrc /etc/mecabrc.bak
sudo sed -i 's_^dicdir.*_; &\'$'\ndicdir = /var/lib/mecab/dic/mecab-ipadic-neologd_' /etc/mecabrc
--- /etc/mecabrc.bak
+++ /etc/mecabrc
@@ -3,7 +3,8 @@
 ;
 ; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
 ;
-dicdir = /var/lib/mecab/dic/debian
+; dicdir = /var/lib/mecab/dic/debian
+dicdir = /var/lib/mecab/dic/mecab-ipadic-neologd

 ; userdic = /home/foo/bar/user.dic

Help

$ aovec -h
usage: aovec [-h] [-V] {clone,c,parse,p,mkvec,m} ...

Make Word2Vec from aozorabunko/aozorabunko

positional arguments:
  {clone,c,parse,p,mkvec,m}
    clone (c)           clone aozorabunko/aozorabunko (>20GB)
    parse (p)           parse html files and write to results
    mkvec (m)           make word2vec and write to *.model

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
$ aovec clone -h
usage: aovec clone [-h]

optional arguments:
  -h, --help  show this help message and exit
$ aovec parse -h
usage: aovec parse [-h] [-d DIR]

optional arguments:
  -h, --help            show this help message and exit
  -d DIR, --savedir DIR
                        directory name of saving results (default: novels)
$ aovec mkvec -h
usage: aovec mkvec [-h] [-d DIR] [-o NAME] [-e INT] [-v INT] [-m INT] [-w INT]
                   [-p INT] [-b] [--both]

optional arguments:
  -h, --help            show this help message and exit
  -d DIR, --parsedir DIR
                        directory name of saved parsing results (default:
                        novels)
  -o NAME, --model NAME
                        name of word2vec model (default: aozora_model)
  -e INT, --epochs INT  number of word2vec epochs (default: 5)
  -v INT, --vector_size INT
                        dimensionality of the word vectors (default: 1000)
  -m INT, --min_count INT
                        ignore words total frequency lower than this (default:
                        5)
  -w INT, --window INT  window size of words before and for learning (default:
                        5)
  -p INT, --workers INT
                        worker threads (default: 3)
  -b, --binary          save model files as one binary (default: False)
  --both                save model files as both row data and binary (default:
                        False)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aovec-1.2.1.tar.gz (9.0 kB view details)

Uploaded Source

Built Distribution

aovec-1.2.1-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file aovec-1.2.1.tar.gz.

File metadata

  • Download URL: aovec-1.2.1.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for aovec-1.2.1.tar.gz
Algorithm Hash digest
SHA256 c42da1f4ca88ef3bf9e100f8c4ededd53cd776cb6633dee05d742cd524c1d5ec
MD5 f64f8e14dec2f2452d4dd9875efd7d3b
BLAKE2b-256 b63e45b65cbe86c6793ca900c2b753ec49ae27986a141c9c7517f50651eb2564

See more details on using hashes here.

File details

Details for the file aovec-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: aovec-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for aovec-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 effd3f7a5e49adc7f74cd923ca0d04a8b681cd21f2e720fa80865ec0e87a6e11
MD5 a12fbee9e141bd95a1e3ae729b44764b
BLAKE2b-256 8c0c461ddcabf2872a95f991396a429a8c0225e26c73f92d3564913cfb195765

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page