Skip to main content

A generic language stemming utility, dedicated for gensim word-embedding.

Project description

Word embedding: generic iterative stemmer

PyPI version Pipeline codecov Ruff Code style: black Imports: isort Type checked: mypy Linting: pylint

A generic helper for training gensim and fasttext word embedding models.
Specifically, this repository was created in order to implement stemming on a Wikipedia-based corpus in Hebrew, but it will probably also work for other corpus sources and languages as well.

Important to note that while there are sophisticated and efficient approaches to the stemming task, this repository implements a naive approach with no strict time or memory considerations (more about that in the explanation section).

Based on https://github.com/liorshk/wordembedding-hebrew.

Setup

  1. Create a python3 virtual environment.
  2. Install dependencies using make install (this will run tests too).

Usage

The general flow is as follows:

  1. Get a text corpus (for example, from Wikipedia).
  2. Create a training program.
  3. Run a StemmingTrainer.

The output of the training process is a generic_iterative_stemmer.models.StemmedKeyedVectors object (in the form of a .kv file). It has the same interface as the standard gensim.models.KeyedVectors, so the 2 can be used interchangeably.

0. (Optional) Set up a language data cache

generic_iterative_stemmer uses a language data cache to store its output and intermediate results.
The language data directory is useful if you want to train multiple models on the same corpus, or if you want to train a model on a corpus that you've already trained on in the past, with different parameters.

To set up the language data cache, run mkdir -p ~/.cache/language_data.

Tip: soft-link the language data cache to your project's root directory, e.g. ln -s ~/.cache/language_data language_data.

1. Get a text corpus

If you don't a specific corpus in mind, you can use Wikipedia. Here's how:

  1. Under ~/.cache/language_data folder, create a directory for your corpus (for example, wiki-he).
  2. Download Hebrew (or any other language) dataset from Wikipedia:
    1. Go to wikimedia dumps (in the URL, replace he with your language code).
    2. Download the matching wiki-latest-pages-articles.xml.bz2 file, and place it in your corpus directory.
  3. Create initial text corpus: run the script inside notebooks/create_corpus.py (change parameters as needed).
    This will create a corpus.txt file in your corpus directory. It takes roughly 15 minutes to run (depending on the corpus size and your computer).

2. Create a training program

TODO

3. Run a StemmingTrainer

TODO

4. Play with your trained model

Play with your trained model using playground.ipynb.

Generic iterative stemming

TODO: Explain the algorithm.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

generic_iterative_stemmer-1.1.7.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

generic_iterative_stemmer-1.1.7-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file generic_iterative_stemmer-1.1.7.tar.gz.

File metadata

File hashes

Hashes for generic_iterative_stemmer-1.1.7.tar.gz
Algorithm Hash digest
SHA256 112f85741245995684cfdfe577efe8c955c2af006d4b8daebb764251828518c8
MD5 50123d2630abcf23cad9d0aeafdc9104
BLAKE2b-256 a24a92f426ff63e188f0ab3cb0a1a7bc0b17bbaaab33f1eaa99b2b131d79ac44

See more details on using hashes here.

File details

Details for the file generic_iterative_stemmer-1.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for generic_iterative_stemmer-1.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 0d6be8a1284f97a80a8c61092c447ba8724d6859a71590b86fdfe2be7091a9c8
MD5 6e4da2fd496af9ffe2d6f17229629c48
BLAKE2b-256 5b2b30d5a5574bc1eb26d51c5374d5f1243ac3d89d8022ccb568d6a0d5df6743

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page