Make Word2Vec from aozorabunko/aozorabunko
Project description
aovec
-
Make Word2Vec from aozorabunko/aozorabunko
-
Pre-built models are available from
week*
Releases.
Requirements
- Git
- MeCab
- MeCab Checker: src/check_mecab.py
How to use
- Make
*.model
file
# Install from pypi
pip install aovec
# Clone aozorabunko/aozorabunko (>20GB)
aovec clone
# Parse html files and write to results to novels/
aovec parse
# Make word2vec and write to aozora_model.model
aovec mkvec
- Use from Python (See: official document)
from gensim.models import Word2Vec, KeyedVectors
# *.model+*.model.syn1neg.npy+*.model.wv.vectors.npy
model = Word2Vec.load('aozora_model.model')
# or...
# *.kv
model = KeyedVectors.load_word2vec_format('aozora_model.kv')
# or...(fastest way to load)
# *.kv.bin
model = KeyedVectors.load_word2vec_format('aozora_model.kv.bin',
binary=True,
unicode_errors='ignore')
(Optional) Set up mecab-ipadic-neologd
on Ubuntu
Download and install
sudo apt install build-essential
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd neologd && cd $_
sudo bin/install-mecab-ipadic-neologd -y
sudo mv /usr/lib/*/mecab/dic/mecab-ipadic-neologd /var/lib/mecab/dic
Update /etc/mecabrc
sudo cp /etc/mecabrc /etc/mecabrc.bak
sudo sed -i 's_^dicdir.*_; &\'$'\ndicdir = /var/lib/mecab/dic/mecab-ipadic-neologd_' /etc/mecabrc
--- /etc/mecabrc.bak
+++ /etc/mecabrc
@@ -3,7 +3,8 @@
;
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
;
-dicdir = /var/lib/mecab/dic/debian
+; dicdir = /var/lib/mecab/dic/debian
+dicdir = /var/lib/mecab/dic/mecab-ipadic-neologd
; userdic = /home/foo/bar/user.dic
Help
$ aovec -h
usage: aovec [-h] [-V] {clone,c,parse,p,mkvec,m} ...
Make Word2Vec from aozorabunko/aozorabunko
positional arguments:
{clone,c,parse,p,mkvec,m}
clone (c) clone aozorabunko/aozorabunko (>20GB)
parse (p) parse html files and write to results
mkvec (m) make word2vec and write to *.model
optional arguments:
-h, --help show this help message and exit
-V, --version show program's version number and exit
$ aovec clone -h
usage: aovec clone [-h]
optional arguments:
-h, --help show this help message and exit
$ aovec parse -h
usage: aovec parse [-h] [-d DIR]
optional arguments:
-h, --help show this help message and exit
-d DIR, --savedir DIR
directory name of saving results (default: novels)
$ aovec mkvec -h
usage: aovec mkvec [-h] [-d DIR] [-o NAME] [-e INT] [-v INT] [-m INT] [-w INT]
[-p INT] [-b] [--both]
optional arguments:
-h, --help show this help message and exit
-d DIR, --parsedir DIR
directory name of saved parsing results (default:
novels)
-o NAME, --model NAME
name of word2vec model (default: aozora_model)
-e INT, --epochs INT number of word2vec epochs (default: 5)
-v INT, --vector_size INT
dimensionality of the word vectors (default: 1000)
-m INT, --min_count INT
ignore words total frequency lower than this (default:
5)
-w INT, --window INT window size of words before and for learning (default:
5)
-p INT, --workers INT
worker threads (default: 3)
-b, --binary save model files as one binary (default: False)
--both save model files as both row data and binary (default:
False)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
aovec-1.2.1.tar.gz
(9.0 kB
view details)
Built Distribution
aovec-1.2.1-py3-none-any.whl
(9.1 kB
view details)
File details
Details for the file aovec-1.2.1.tar.gz
.
File metadata
- Download URL: aovec-1.2.1.tar.gz
- Upload date:
- Size: 9.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c42da1f4ca88ef3bf9e100f8c4ededd53cd776cb6633dee05d742cd524c1d5ec |
|
MD5 | f64f8e14dec2f2452d4dd9875efd7d3b |
|
BLAKE2b-256 | b63e45b65cbe86c6793ca900c2b753ec49ae27986a141c9c7517f50651eb2564 |
File details
Details for the file aovec-1.2.1-py3-none-any.whl
.
File metadata
- Download URL: aovec-1.2.1-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | effd3f7a5e49adc7f74cd923ca0d04a8b681cd21f2e720fa80865ec0e87a6e11 |
|
MD5 | a12fbee9e141bd95a1e3ae729b44764b |
|
BLAKE2b-256 | 8c0c461ddcabf2872a95f991396a429a8c0225e26c73f92d3564913cfb195765 |