Word Tokenizer to create each word into token and decode back

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

LangToken

Overview

LangToken is a Python library designed for text preprocessing and tokenization. It converts text into token IDs and provides functionality to encode and decode text, ensuring consistent mapping between words and their token representations.

This project includes a robust implementation, detailed documentation, and unit tests to validate its functionality.

Features

Vocabulary Creation: Automatically generates a token-to-ID mapping and its inverse from text.
Encoding: Converts text into a list of token IDs.
Decoding: Converts a list of token IDs back into readable text.
File Integration: Reads text files and processes them for tokenization.
Handling Unknown Tokens: Maps unknown words to a special <|unk|> token.
Special Tokens: Includes <|unk|> for unknown words and <|endoftext|> for file-ending markers.

Methods

Method	Description	Parameters	Returns
`__init__()`	Initializes the `Tokenizer` instance. Sets up the vocabulary dictionaries (`str_to_int`, `int_to_str`) and text container (`text`).	None	None
`pass_file()`	Reads a text file into the `text` attribute.	`file_path` (str): Path to the text file. `enc` (str): File encoding (e.g., "utf-8").	None
`fit()`	Creates a vocabulary from the `text` attribute by preprocessing it and assigning a unique integer to each word or symbol. Adds special tokens.	None	None
`get_token()`	Retrieves the vocabulary as a dictionary mapping tokens to integers.	None	`dict`: The vocabulary dictionary.
`get_token_decoder()`	Retrieves the inverse vocabulary dictionary (mapping integers to tokens).	None	`dict`: The inverse vocabulary.
`encode()`	Converts a given text into a list of integers based on the created vocabulary. Unknown words are replaced by <\|unk\|>.	`text` (str): The input text to encode.	`list`: List of integer tokens.
`decode()`	Converts a list of integers back into the original text.	`ids` (list): List of integer tokens.	`str`: Decoded text.

Use

from LangToken.tokenizer import Tokenizer

# Initialize the tokenizer
tokenizer = Tokenizer()

# Pass a file to read text (example: 'sample.txt')
tokenizer.pass_file("sample.txt", "utf-8")

# Fit the tokenizer to create a vocabulary
tokenizer.fit()

# Get the vocabulary as a dictionary
vocabulary = tokenizer.get_token()
print("Vocabulary:", vocabulary)

# Encode a new text using the vocabulary
text_to_encode = "Hello, how are you?"
encoded_text = tokenizer.encode(text_to_encode)
print("Encoded Text:", encoded_text)

# Decode the encoded text back to its original form
decoded_text = tokenizer.decode(encoded_text)
print("Decoded Text:", decoded_text)

`sample.txt`

Hello, world! How are you?

Output

Vocabulary: {'!': 0, ',': 1, 'Hello': 2, 'How': 3, 'are': 4, 'world': 5, 'you': 6, '<|unk|>': 7, '<|endoftext|>': 8}
Encoded Text: [2, 1, 3, 4, 6, 0]
Decoded Text: Hello, how are you?

Class Diagram

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.3

Jan 19, 2025

0.1.2

Jan 19, 2025

0.1.1

Jan 19, 2025

0.1.0

Jan 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

LangToken-0.1.3.tar.gz (498.0 kB view details)

Uploaded Jan 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

LangToken-0.1.3-py3-none-any.whl (5.8 kB view details)

Uploaded Jan 19, 2025 Python 3

File details

Details for the file LangToken-0.1.3.tar.gz.

File metadata

Download URL: LangToken-0.1.3.tar.gz
Upload date: Jan 19, 2025
Size: 498.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.5

File hashes

Hashes for LangToken-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`2442afe98984a3cbfbd374c1fd5cfee9d71db6dcafc83a3407ebc1ac1dce99bb`
MD5	`75adf5c0761205ac1aaa1121b77bb741`
BLAKE2b-256	`e78557e404056a52e7c04e106ba254007eb4053230e8cc665bfed04c6c6fe94a`

See more details on using hashes here.

File details

Details for the file LangToken-0.1.3-py3-none-any.whl.

File metadata

Download URL: LangToken-0.1.3-py3-none-any.whl
Upload date: Jan 19, 2025
Size: 5.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.5

File hashes

Hashes for LangToken-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3c0d1c832c633ebe3f22ff260129b9bfa7f42b427806ade483bf3404e1391512`
MD5	`8d84327a54e9a80657a9f598182528c6`
BLAKE2b-256	`493162051aa48121403f69bd13719ef964e055376e5fcf3230bf7f092e583367`

See more details on using hashes here.

LangToken 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LangToken

Overview

Features

Methods

Use

`sample.txt`

Output

Class Diagram

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes