Word Tokenizer to create each word into token and decode back
Project description
LangToken
Overview
LangToken is a Python library designed for text preprocessing and tokenization. It converts text into token IDs and provides functionality to encode and decode text, ensuring consistent mapping between words and their token representations.
This project includes a robust implementation, detailed documentation, and unit tests to validate its functionality.
Features
- Vocabulary Creation: Automatically generates a token-to-ID mapping and its inverse from text.
- Encoding: Converts text into a list of token IDs.
- Decoding: Converts a list of token IDs back into readable text.
- File Integration: Reads text files and processes them for tokenization.
- Handling Unknown Tokens: Maps unknown words to a special
<|unk|>token. - Special Tokens: Includes
<|unk|>for unknown words and<|endoftext|>for file-ending markers.
Class Diagram
Use
from word_token.tokenizer import Tokenizer
# Initialize the tokenizer
tokenizer = Tokenizer()
# Pass a file to the tokenizer
tokenizer.pass_file("example.txt", "utf-8")
# Generate the vocabulary
tokenizer.fit()
# Encode a text string
encoded = tokenizer.encode("Hello, world!")
print("Encoded:", encoded)
# Decode the encoded IDs
decoded = tokenizer.decode(encoded)
print("Decoded:", decoded)
Output
Encoded: [0, 1, 2, 3]
Decoded: Hello, world!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file LangToken-0.1.0.tar.gz.
File metadata
- Download URL: LangToken-0.1.0.tar.gz
- Upload date:
- Size: 497.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26eaba8ac4a27f4b977f15291856eaf395f9a653c8855bb495b97e218cb8fdaf
|
|
| MD5 |
c261add34483ec1db5c6e44135a6b8b5
|
|
| BLAKE2b-256 |
54c5af707b52fb2f2747559b86f002e1ebe5295a4c3b9b03d2c147fc5bceab9a
|
File details
Details for the file LangToken-0.1.0-py3-none-any.whl.
File metadata
- Download URL: LangToken-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
762b23534bbe3772ef37254066802c0664e8e859f1e26ed93a4cef6495953752
|
|
| MD5 |
d54e2d7f312c4bacc3be407c8c1ab6d5
|
|
| BLAKE2b-256 |
449bd8637ead57561751e1b7ead8ce64761847dfeacadbf2832434fd41ca6265
|