Skip to main content

A generic language stemming utility, dedicated for gensim word-embedding.

Project description

Word embedding: generic iterative stemmer

PyPI version Pipeline codecov Ruff Code style: black Imports: isort Type checked: mypy Linting: pylint

A generic helper for training gensim and fasttext word embedding models.
Specifically, this repository was created in order to implement stemming on a Wikipedia-based corpus in Hebrew, but it will probably also work for other corpus sources and languages as well.

Important to note that while there are sophisticated and efficient approaches to the stemming task, this repository implements a naive approach with no strict time or memory considerations (more about that in the explanation section).

Based on https://github.com/liorshk/wordembedding-hebrew.

Setup

  1. Create a python3 virtual environment.
  2. Install dependencies using make install (this will run tests too).

Usage

The general flow is as follows:

  1. Get a text corpus (for example, from Wikipedia).
  2. Create a training program.
  3. Run a StemmingTrainer.

The output of the training process is a generic_iterative_stemmer.models.StemmedKeyedVectors object (in the form of a .kv file). It has the same interface as the standard gensim.models.KeyedVectors, so the 2 can be used interchangeably.

0. (Optional) Set up a language data cache

generic_iterative_stemmer uses a language data cache to store its output and intermediate results.
The language data directory is useful if you want to train multiple models on the same corpus, or if you want to train a model on a corpus that you've already trained on in the past, with different parameters.

To set up the language data cache, run mkdir -p ~/.cache/language_data.

Tip: soft-link the language data cache to your project's root directory, e.g. ln -s ~/.cache/language_data language_data.

1. Get a text corpus

If you don't a specific corpus in mind, you can use Wikipedia. Here's how:

  1. Under ~/.cache/language_data folder, create a directory for your corpus (for example, wiki-he).
  2. Download Hebrew (or any other language) dataset from Wikipedia:
    1. Go to wikimedia dumps (in the URL, replace he with your language code).
    2. Download the matching wiki-latest-pages-articles.xml.bz2 file, and place it in your corpus directory.
  3. Create initial text corpus: run the script inside notebooks/create_corpus.py (change parameters as needed).
    This will create a corpus.txt file in your corpus directory. It takes roughly 15 minutes to run (depending on the corpus size and your computer).

2. Create a training program

TODO

3. Run a StemmingTrainer

TODO

4. Play with your trained model

Play with your trained model using playground.ipynb.

Generic iterative stemming

TODO: Explain the algorithm.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

generic_iterative_stemmer-1.1.7.tar.gz (16.0 kB view hashes)

Uploaded Source

Built Distribution

generic_iterative_stemmer-1.1.7-py3-none-any.whl (20.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page