Skip to main content

A generic language stemming utility, dedicated for gensim word-embedding.

Project description

Word embedding: generic iterative stemmer

PyPI version Pipeline codecov Ruff Code style: black Imports: isort Type checked: mypy Linting: pylint

A generic helper for training gensim and fasttext word embedding models.
Specifically, this repository was created in order to implement stemming on a Wikipedia-based corpus in Hebrew, but it will probably also work for other corpus sources and languages as well.

Important to note that while there are sophisticated and efficient approaches to the stemming task, this repository implements a naive approach with no strict time or memory considerations (more about that in the explanation section).

Based on https://github.com/liorshk/wordembedding-hebrew.

Setup

  1. Create a python3 virtual environment.
  2. Install dependencies using make install (this will run tests too).

Usage

This section shows the basic flow this repository was designed to perform. It supports more complicated flows as well.

The output of the training process is a StemmedKeyedVectors object (in the form of a .kv file), which inherits the standard gensim.models.KeyedVectors.

  1. Under ./data folder, create a directory for your corpus (for example, wiki-he).
  2. Download Hebrew (or any other language) dataset from Wikipedia:
    1. Go to wikimedia dumps.
    2. Download hewiki-latest-pages-articles.xml.bz2, and save it under ./data/wiki-he.
  3. Create your initial text corpus: TODO: create a notebook for that.
  4. Train the model: TODO: create a notebook for that.
  5. Play with your trained model using playground.ipynb.

Generic iterative stemming

TODO: Explain the algorithm.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

generic_iterative_stemmer-1.1.5.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

generic_iterative_stemmer-1.1.5-py3-none-any.whl (20.1 kB view details)

Uploaded Python 3

File details

Details for the file generic_iterative_stemmer-1.1.5.tar.gz.

File metadata

File hashes

Hashes for generic_iterative_stemmer-1.1.5.tar.gz
Algorithm Hash digest
SHA256 0f68b9db94c7e0cd72f811cd6bca151124aa9946f87381649b61e7bc1b02ed4f
MD5 d84e9ae88dd0c0bd8ac56ba684ec7277
BLAKE2b-256 4be55968e8280adaec0c6c8e529bdcf00fee445fdac42975484e809b3a4d2b88

See more details on using hashes here.

File details

Details for the file generic_iterative_stemmer-1.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for generic_iterative_stemmer-1.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 8bb39f34b321bc7506e8f1f559eb7f4ee33503618c5e1ecc0c54b55699d246d0
MD5 48ecca769a7d1b84c9d7dae18f81c84b
BLAKE2b-256 df4492b5e0c5a5e321461cb6fbf1d69e82db12fde41d0ee2bfdabf325bd51e95

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page