Skip to main content

A generic language stemming utility, dedicated for gensim word-embedding.

Project description

Word embedding: generic iterative stemmer

PyPI version Pipeline codecov Ruff Code style: black Imports: isort Type checked: mypy Linting: pylint

A generic helper for training gensim and fasttext word embedding models.
Specifically, this repository was created in order to implement stemming on a Wikipedia-based corpus in Hebrew, but it will probably also work for other corpus sources and languages as well.

Important to note that while there are sophisticated and efficient approaches to the stemming task, this repository implements a naive approach with no strict time or memory considerations (more about that in the explanation section).

Based on https://github.com/liorshk/wordembedding-hebrew.

Setup

  1. Create a python3 virtual environment.
  2. Install dependencies using make install (this will run tests too).

Usage

The general flow is as follows:

  1. Get a text corpus (for example, from Wikipedia).
  2. Create a training program.
  3. Run a StemmingTrainer.

The output of the training process is a generic_iterative_stemmer.models.StemmedKeyedVectors object (in the form of a .kv file). It has the same interface as the standard gensim.models.KeyedVectors, so the 2 can be used interchangeably.

0. (Optional) Set up a language data cache

generic_iterative_stemmer uses a language data cache to store its output and intermediate results.
The language data directory is useful if you want to train multiple models on the same corpus, or if you want to train a model on a corpus that you've already trained on in the past, with different parameters.

To set up the language data cache, run mkdir -p ~/.cache/language_data.

Tip: soft-link the language data cache to your project's root directory, e.g. ln -s ~/.cache/language_data language_data.

1. Get a text corpus

If you don't a specific corpus in mind, you can use Wikipedia. Here's how:

  1. Under ~/.cache/language_data folder, create a directory for your corpus (for example, wiki-he).
  2. Download Hebrew (or any other language) dataset from Wikipedia:
    1. Go to wikimedia dumps (in the URL, replace he with your language code).
    2. Download the matching wiki-latest-pages-articles.xml.bz2 file, and place it in your corpus directory.
  3. Create initial text corpus: run the script inside notebooks/create_corpus.py (change parameters as needed).
    This will create a corpus.txt file in your corpus directory. It takes roughly 15 minutes to run (depending on the corpus size and your computer).

2. Create a training program

TODO

3. Run a StemmingTrainer

TODO

4. Play with your trained model

Play with your trained model using playground.ipynb.

Generic iterative stemming

TODO: Explain the algorithm.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

generic_iterative_stemmer-1.2.0.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

generic_iterative_stemmer-1.2.0-py3-none-any.whl (20.3 kB view details)

Uploaded Python 3

File details

Details for the file generic_iterative_stemmer-1.2.0.tar.gz.

File metadata

File hashes

Hashes for generic_iterative_stemmer-1.2.0.tar.gz
Algorithm Hash digest
SHA256 ce9c39bc73d1206037699afbb36fa00f66419936cafbde53f62f75cf3b29cfe8
MD5 6f84d6c3032a9b95b66a355ee74dc9b8
BLAKE2b-256 ff5fecbb3d56f7f78f6eb2b4e9d0b9bcdf9beb8dacfcc14e110fac6f9aa28848

See more details on using hashes here.

Provenance

The following attestation bundles were made for generic_iterative_stemmer-1.2.0.tar.gz:

Publisher: pipeline.yml on asaf-kali/generic-iterative-stemmer

Attestations:

File details

Details for the file generic_iterative_stemmer-1.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for generic_iterative_stemmer-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7325fd1eb2d03f6e4f6729316b253a72e1a508429cbacb4a78aceeb4852d20d3
MD5 3208e9a8887eb453b3201486e193ed3a
BLAKE2b-256 05d0422e766c324126a3996db70d82704aa9e9e66dd264a3c65d3d37ed80e635

See more details on using hashes here.

Provenance

The following attestation bundles were made for generic_iterative_stemmer-1.2.0-py3-none-any.whl:

Publisher: pipeline.yml on asaf-kali/generic-iterative-stemmer

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page