A generic language stemming utility, dedicated for gensim word-embedding.
Project description
Word embedding: generic iterative stemmer
A generic helper for training gensim
and fasttext
word embedding models.
Specifically, this repository was created in order to
implement stemming
on a Wikipedia-based corpus in Hebrew, but it will probably also work for other
corpus sources and languages as well.
Important to note that while there are sophisticated and efficient approaches to the stemming task, this repository implements a naive approach with no strict time or memory considerations (more about that in the explanation section).
Based on https://github.com/liorshk/wordembedding-hebrew.
Setup
- Create a
python3
virtual environment. - Install dependencies using
make install
(this will run tests too).
Usage
The general flow is as follows:
- Get a text corpus (for example, from Wikipedia).
- Create a training program.
- Run a
StemmingTrainer
.
The output of the training process is a generic_iterative_stemmer.models.StemmedKeyedVectors
object
(in the form of a .kv
file). It has the same interface as the standard gensim.models.KeyedVectors
,
so the 2 can be used interchangeably.
0. (Optional) Set up a language data cache
generic_iterative_stemmer
uses a language data cache to store its output and intermediate results.
The language data directory is useful if you want to train multiple models on the same corpus, or if you want to
train a model on a corpus that you've already trained on in the past, with different parameters.
To set up the language data cache, run mkdir -p ~/.cache/language_data
.
Tip: soft-link the language data cache to your project's root directory,
e.g. ln -s ~/.cache/language_data language_data
.
1. Get a text corpus
If you don't a specific corpus in mind, you can use Wikipedia. Here's how:
- Under
~/.cache/language_data
folder, create a directory for your corpus (for example,wiki-he
). - Download Hebrew (or any other language) dataset from Wikipedia:
- Go to wikimedia dumps (in the URL, replace
he
with your language code). - Download the matching
wiki-latest-pages-articles.xml.bz2
file, and place it in your corpus directory.
- Go to wikimedia dumps (in the URL, replace
- Create initial text corpus: run the script inside
notebooks/create_corpus.py
(change parameters as needed).
This will create acorpus.txt
file in your corpus directory. It takes roughly 15 minutes to run (depending on the corpus size and your computer).
2. Create a training program
TODO
3. Run a StemmingTrainer
TODO
4. Play with your trained model
Play with your trained model using playground.ipynb
.
Generic iterative stemming
TODO: Explain the algorithm.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file generic_iterative_stemmer-1.1.7.tar.gz
.
File metadata
- Download URL: generic_iterative_stemmer-1.1.7.tar.gz
- Upload date:
- Size: 16.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 112f85741245995684cfdfe577efe8c955c2af006d4b8daebb764251828518c8 |
|
MD5 | 50123d2630abcf23cad9d0aeafdc9104 |
|
BLAKE2b-256 | a24a92f426ff63e188f0ab3cb0a1a7bc0b17bbaaab33f1eaa99b2b131d79ac44 |
File details
Details for the file generic_iterative_stemmer-1.1.7-py3-none-any.whl
.
File metadata
- Download URL: generic_iterative_stemmer-1.1.7-py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d6be8a1284f97a80a8c61092c447ba8724d6859a71590b86fdfe2be7091a9c8 |
|
MD5 | 6e4da2fd496af9ffe2d6f17229629c48 |
|
BLAKE2b-256 | 5b2b30d5a5574bc1eb26d51c5374d5f1243ac3d89d8022ccb568d6a0d5df6743 |