Skip to main content

Word Tokenizer to create each word into token and decode back

Project description

LangToken

Overview

LangToken is a Python library designed for text preprocessing and tokenization. It converts text into token IDs and provides functionality to encode and decode text, ensuring consistent mapping between words and their token representations.

This project includes a robust implementation, detailed documentation, and unit tests to validate its functionality.

Features

  • Vocabulary Creation: Automatically generates a token-to-ID mapping and its inverse from text.
  • Encoding: Converts text into a list of token IDs.
  • Decoding: Converts a list of token IDs back into readable text.
  • File Integration: Reads text files and processes them for tokenization.
  • Handling Unknown Tokens: Maps unknown words to a special <|unk|> token.
  • Special Tokens: Includes <|unk|> for unknown words and <|endoftext|> for file-ending markers.

Class Diagram

Diagram

Use

from LangToken.tokenizer import Tokenizer

# Initialize the tokenizer
tokenizer = Tokenizer()

# Pass a file to the tokenizer
tokenizer.pass_file("example.txt", "utf-8")

# Generate the vocabulary
tokenizer.fit()

# Encode a text string
encoded = tokenizer.encode("Hello, world!")
print("Encoded:", encoded)

# Decode the encoded IDs
decoded = tokenizer.decode(encoded)
print("Decoded:", decoded)

Output

Encoded: [0, 1, 2, 3]
Decoded: Hello, world!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

LangToken-0.1.2.tar.gz (497.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

LangToken-0.1.2-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file LangToken-0.1.2.tar.gz.

File metadata

  • Download URL: LangToken-0.1.2.tar.gz
  • Upload date:
  • Size: 497.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.5

File hashes

Hashes for LangToken-0.1.2.tar.gz
Algorithm Hash digest
SHA256 ef1be5ab4086f3db979b76f3c45dee8694882021f4c8d534d6fc1f2232ec9051
MD5 cbc894bf030a29436b76aa2e71d4011d
BLAKE2b-256 8a9ce58c1c81120341a24240716bc4f795e90c292a86b05b7a7c15034f05183b

See more details on using hashes here.

File details

Details for the file LangToken-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: LangToken-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.5

File hashes

Hashes for LangToken-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 46cb994ff68fcccd2a379a776f73bf664b420172170f0be783a7486a8c2cb0c3
MD5 0e88fa82b9b3a44aa7f47560a1bfd7ab
BLAKE2b-256 7c6ef55cbc9f6e90dfafa439df918cdca9292411a4629277a2ec23e403b678f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page