Skip to main content

Word Tokenizer to create each word into token and decode back

Project description

LangToken

Overview

LangToken is a Python library designed for text preprocessing and tokenization. It converts text into token IDs and provides functionality to encode and decode text, ensuring consistent mapping between words and their token representations.

This project includes a robust implementation, detailed documentation, and unit tests to validate its functionality.

Features

  • Vocabulary Creation: Automatically generates a token-to-ID mapping and its inverse from text.
  • Encoding: Converts text into a list of token IDs.
  • Decoding: Converts a list of token IDs back into readable text.
  • File Integration: Reads text files and processes them for tokenization.
  • Handling Unknown Tokens: Maps unknown words to a special <|unk|> token.
  • Special Tokens: Includes <|unk|> for unknown words and <|endoftext|> for file-ending markers.

Class Diagram

Diagram

Use

from word_token.tokenizer import Tokenizer

# Initialize the tokenizer
tokenizer = Tokenizer()

# Pass a file to the tokenizer
tokenizer.pass_file("example.txt", "utf-8")

# Generate the vocabulary
tokenizer.fit()

# Encode a text string
encoded = tokenizer.encode("Hello, world!")
print("Encoded:", encoded)

# Decode the encoded IDs
decoded = tokenizer.decode(encoded)
print("Decoded:", decoded)

Output

Encoded: [0, 1, 2, 3]
Decoded: Hello, world!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

LangToken-0.1.1.tar.gz (497.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

LangToken-0.1.1-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file LangToken-0.1.1.tar.gz.

File metadata

  • Download URL: LangToken-0.1.1.tar.gz
  • Upload date:
  • Size: 497.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.5

File hashes

Hashes for LangToken-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c018190beb07379be38cf62c2a7b71d6f864a42c4ed9a5f96e0404e591315a53
MD5 380f7d3b67b4b315a8123497556e75c3
BLAKE2b-256 f2324827ba9385a16ffbe8427e10679b197a5a58edb761a2a85a6fa1fcb2efb0

See more details on using hashes here.

File details

Details for the file LangToken-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: LangToken-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.5

File hashes

Hashes for LangToken-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 78be9370c0a8767859e0c3b239d794375422332773b16f0e6ed6879cbc91df1b
MD5 88dbe0ff7ba8d5b4e27edbcde1ba70a5
BLAKE2b-256 11dee18135fe310e1bccc2476e9736a1d5b13e5b35af835c5c219c146577c6f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page