Skip to main content

Word Tokenizer to create each word into token and decode back

Project description

LangToken

Overview

LangToken is a Python library designed for text preprocessing and tokenization. It converts text into token IDs and provides functionality to encode and decode text, ensuring consistent mapping between words and their token representations.

This project includes a robust implementation, detailed documentation, and unit tests to validate its functionality.

Features

  • Vocabulary Creation: Automatically generates a token-to-ID mapping and its inverse from text.
  • Encoding: Converts text into a list of token IDs.
  • Decoding: Converts a list of token IDs back into readable text.
  • File Integration: Reads text files and processes them for tokenization.
  • Handling Unknown Tokens: Maps unknown words to a special <|unk|> token.
  • Special Tokens: Includes <|unk|> for unknown words and <|endoftext|> for file-ending markers.

Methods

Method Description Parameters Returns
__init__() Initializes the Tokenizer instance. Sets up the vocabulary dictionaries (str_to_int, int_to_str) and text container (text). None None
pass_file() Reads a text file into the text attribute. file_path (str): Path to the text file.
enc (str): File encoding (e.g., "utf-8").
None
fit() Creates a vocabulary from the text attribute by preprocessing it and assigning a unique integer to each word or symbol. Adds special tokens. None None
get_token() Retrieves the vocabulary as a dictionary mapping tokens to integers. None dict: The vocabulary dictionary.
get_token_decoder() Retrieves the inverse vocabulary dictionary (mapping integers to tokens). None dict: The inverse vocabulary.
encode() Converts a given text into a list of integers based on the created vocabulary. Unknown words are replaced by <|unk|>. text (str): The input text to encode. list: List of integer tokens.
decode() Converts a list of integers back into the original text. ids (list): List of integer tokens. str: Decoded text.

Use

from LangToken.tokenizer import Tokenizer

# Initialize the tokenizer
tokenizer = Tokenizer()

# Pass a file to read text (example: 'sample.txt')
tokenizer.pass_file("sample.txt", "utf-8")

# Fit the tokenizer to create a vocabulary
tokenizer.fit()

# Get the vocabulary as a dictionary
vocabulary = tokenizer.get_token()
print("Vocabulary:", vocabulary)

# Encode a new text using the vocabulary
text_to_encode = "Hello, how are you?"
encoded_text = tokenizer.encode(text_to_encode)
print("Encoded Text:", encoded_text)

# Decode the encoded text back to its original form
decoded_text = tokenizer.decode(encoded_text)
print("Decoded Text:", decoded_text)

sample.txt

Hello, world! How are you?

Output

Vocabulary: {'!': 0, ',': 1, 'Hello': 2, 'How': 3, 'are': 4, 'world': 5, 'you': 6, '<|unk|>': 7, '<|endoftext|>': 8}
Encoded Text: [2, 1, 3, 4, 6, 0]
Decoded Text: Hello, how are you?

Class Diagram

Diagram

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

LangToken-0.1.3.tar.gz (498.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

LangToken-0.1.3-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file LangToken-0.1.3.tar.gz.

File metadata

  • Download URL: LangToken-0.1.3.tar.gz
  • Upload date:
  • Size: 498.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.5

File hashes

Hashes for LangToken-0.1.3.tar.gz
Algorithm Hash digest
SHA256 2442afe98984a3cbfbd374c1fd5cfee9d71db6dcafc83a3407ebc1ac1dce99bb
MD5 75adf5c0761205ac1aaa1121b77bb741
BLAKE2b-256 e78557e404056a52e7c04e106ba254007eb4053230e8cc665bfed04c6c6fe94a

See more details on using hashes here.

File details

Details for the file LangToken-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: LangToken-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.5

File hashes

Hashes for LangToken-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3c0d1c832c633ebe3f22ff260129b9bfa7f42b427806ade483bf3404e1391512
MD5 8d84327a54e9a80657a9f598182528c6
BLAKE2b-256 493162051aa48121403f69bd13719ef964e055376e5fcf3230bf7f092e583367

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page