A generic language stemming utility, dedicated for gensim word-embedding.
Project description
Word embedding: generic iterative stemmer
A generic helper for training gensim
and fasttext
word embedding models.
Specifically, this repository was created in order to
implement stemming
on a Wikipedia-based corpus in Hebrew, but it will probably also work for other
corpus sources and languages as well.
Important to note that while there are sophisticated and efficient approaches to the stemming task, this repository implements a naive approach with no strict time or memory considerations (more about that in the explanation section).
Based on https://github.com/liorshk/wordembedding-hebrew.
Setup
- Create a
python3
virtual environment. - Install dependencies using
make install
(this will run tests too).
Usage
This section shows the basic flow this repository was designed to perform. It supports more complicated flows as well.
The output of the training process is a StemmedKeyedVectors
object (in the
form of a .kv
file), which inherits the standard gensim.models.KeyedVectors
.
-
Under
./data
folder, create a directory for your corpus (for example,wiki-he
). -
Download Hebrew (or any other language) dataset from Wikipedia:
- Go to wikimedia dumps.
- Download
hewiki-latest-pages-articles.xml.bz2
, and save it under./data/wiki-he
.
-
Create your initial text corpus:
TODO: create a notebook for that.
-
Train the model:
TODO: create a notebook for that.
-
Play with your trained model using
playground.ipynb
.
Generic iterative stemming
TODO: Explain the algorithm.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file generic-iterative-stemmer-0.3.1.tar.gz
.
File metadata
- Download URL: generic-iterative-stemmer-0.3.1.tar.gz
- Upload date:
- Size: 15.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 181b25b62ec5fe4a6ea065aaae5073fe833bba81361634674b8d7a59611e60af |
|
MD5 | 84d216e1efec2e239ea3eda435984eb3 |
|
BLAKE2b-256 | 91220afc61d05a5c571733f5513e110f093968f5a1925c8522b9538bfaea3fa7 |