Skip to main content

A generic language stemming utility, dedicated for gensim word-embedding.

Project description

Word embedding: generic iterative stemmer

A generic helper for training gensim and fasttext word embedding models.
Specifically, this repository was created in order to implement stemming on a Wikipedia-based corpus in Hebrew, but it will probably also work for other corpus sources and languages as well.

Important to note that while there are sophisticated and efficient approaches to the stemming task, this repository implements a naive approach with no strict time or memory considerations (more about that in the explanation section).

Based on https://github.com/liorshk/wordembedding-hebrew.

Lint Tests

Setup

  1. Create a python3 virtual environment.
  2. Install dependencies using make install (this will run tests too).

Usage

This section shows the basic flow this repository was designed to perform. It supports more complicated flows as well.

The output of the training process is a StemmedKeyedVectors object (in the form of a .kv file), which inherits the standard gensim.models.KeyedVectors.

  1. Under ./data folder, create a directory for your corpus (for example, wiki-he).

  2. Download Hebrew (or any other language) dataset from Wikipedia:

    1. Go to wikimedia dumps.
    2. Download hewiki-latest-pages-articles.xml.bz2, and save it under ./data/wiki-he.
  3. Create your initial text corpus:

    TODO: create a notebook for that.

  4. Train the model:

    TODO: create a notebook for that.

  5. Play with your trained model using playground.ipynb.

Generic iterative stemming

TODO: Explain the algorithm.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

generic-iterative-stemmer-1.0.6.tar.gz (22.2 kB view details)

Uploaded Source

Built Distribution

generic_iterative_stemmer-1.0.6-py3-none-any.whl (30.3 kB view details)

Uploaded Python 3

File details

Details for the file generic-iterative-stemmer-1.0.6.tar.gz.

File metadata

File hashes

Hashes for generic-iterative-stemmer-1.0.6.tar.gz
Algorithm Hash digest
SHA256 3f8ec02a34989d1251f45405a89157df5bd2100b5446f01f48a44c528ce85b0c
MD5 532e32b43a38aee87617878ee054d0e4
BLAKE2b-256 2b3f83fcd698263fa3d6d996361d78f37d86834da603e47a05c2f399cf27c2c4

See more details on using hashes here.

File details

Details for the file generic_iterative_stemmer-1.0.6-py3-none-any.whl.

File metadata

File hashes

Hashes for generic_iterative_stemmer-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 06ec0dbcdb103b3a8624667c3412718184aa6191157916e9c72b68de6d4da5ad
MD5 4bb6f7ec4bc03fcd2c9cff3696196aa4
BLAKE2b-256 60cf15554322b788ef440e6e0d9ad735fed3d6d4e66b4e306d2f2a388dbf398f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page