Skip to main content

A Genetic Algorithm Tokenizer

Project description

GeneTok: Genetic Algorithm-based Tokenizer

GeneTok is a Python library that employs genetic algorithms to craft a tokenizer. This method stands out by using the principles of genetic evolution, the concepts of individuals (in this case, index ranges from the text), mutation, and crossover—to dynamically generate and refine token sets based on the input text(s). This approach is especially beneficial for natural language processing (NLP) tasks, offering a novel solution where traditional tokenization methods might be slow.

Features

  • Genetic Algorithm Foundation: Built on the Finch library, GeneTok excels in speed and efficiency, utilizing genetic algorithms for token evolution.
  • Customizable Tokenization: Users can define token size ranges and control the token evolution process, allowing for tailored tokenization strategies.
  • Fitness Function Optimization: Utilizes a fitness function to assess and select the most effective tokens, considering their frequency and relevance in the source text.
  • Serialization Support: Enables saving and loading the tokenizer's state, facilitating easy reuse and distribution.
  • Resumable Training: Training sessions can be paused and resumed with entirely different texts, offering flexibility in model development.

Colab notebooks:

  • simple example: genetok. Quick overview of the library, train a tokenizer on a few GBs of text rather quickly,

Installation

GeneTok requires Python 3.6 or later. You can install GeneTok directly from the source code:

git clone https://github.com/yourusername/genetok.git
cd genetok
pip install .

Quick Start

Here's a quick example to get you started with Genetok:

from genetok.tokenizer import GeneticTokenizer
# Initialize the GeneticTokenizer
tokenizer = GeneticTokenizer(step_epochs=4)
#Sample text
text = "This is a sample text for the GeneticTokenizer."
# Evolve the tokenizer based on the sample text
# Pass a list of texts, keep each under like 10,000 chars for best speed
tokenizer.evolve([text])
#Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
# Detokenize the tokens back to text
original_text = tokenizer.detokenize(tokens)
print("Original Text:", original_text)

How It Works

Genetok uses a genetic algorithm to evolve a set of tokens that are most effective for tokenizing a given text. It starts with a random set of tokens and iteratively applies genetic operations such as mutation and crossover to evolve these tokens. Each token is represented simply by it's start and end index in a source text. Mutation causes these ranges to change. Every time a good token is found it is added to the list. The fitness of each token is determined based on its frequency and utility in the source text, guiding the selection process towards more effective tokenization strategies.

Drawbacks:

  • Speed has it's costs, the tokens may not be the absolute global "best", but the training is much faster than typical tokenizers.
  • Far from complete, lots more features to add and bugs to weed out.

Example Implementation

For a detailed example of how to use Genetok on a larger dataset, refer to implimentation.py in the repository. This example demonstrates loading a dataset, processing the text, evolving the tokenizer, and then using it to tokenize new texts.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genetok-0.1.7.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

genetok-0.1.7-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file genetok-0.1.7.tar.gz.

File metadata

  • Download URL: genetok-0.1.7.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for genetok-0.1.7.tar.gz
Algorithm Hash digest
SHA256 c94643afc5b9ffef410790576609ed70d250e2a0400763203ebc9ce94aac88e9
MD5 a10df66cb220fefc576f94c6c0b63651
BLAKE2b-256 dd3a108e2596950f4af2e6a9039275ae33cbdea4765f7082c30c18a3d9804328

See more details on using hashes here.

File details

Details for the file genetok-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: genetok-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for genetok-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 1e1084fec9e3f50230a223833a045a3c5658d9e168d8f0e51f8417a79a718c7e
MD5 c36218c50fe121d5eba9f8271333cfb1
BLAKE2b-256 e51e0e34dcf2c6f4dbe32fb297e3fdb65a761eda8f9d12825a78f94bfb6bc708

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page