Skip to main content

fast bpe tokenizer used for experimental models

Project description

Shredword

A fast and efficient tokenizer library for natural language processing tasks, built with Python and optimized C backend.

Features

  • High Performance: Fast tokenization powered by optimized C libraries
  • Multiple Encodings: Support for various tokenization models and vocabularies
  • Flexible API: Easy-to-use Python interface with comprehensive functionality
  • Special Tokens: Built-in support for special tokens and custom vocabularies
  • Fallback Mechanisms: Robust error handling with fallback tokenization
  • BPE Support: Byte Pair Encoding implementation for subword tokenization

Installation

pip install shredword

Quick Start

from shred import load_encoding

# Load a tokenizer
tokenizer = load_encoding("pre_16k")

# Encode text to tokens
tokens = tokenizer.encode("Hello, world!")
print(tokens)  # [10478, 10408, 10416, 10416, ...

# Decode tokens back to text
text = tokenizer.decode(tokens)
print(text)  # "Hello, world!"

# Get vocabulary information
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Special tokens: {tokenizer.special_tokens}")

Documentation

For detailed usage instructions, API reference, and examples, please see our User Documentation.

Supported Encodings

Shredword supports various pre-trained tokenization models. The library automatically downloads vocabulary files from the official repository when needed.

Contributing

We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.

Development Setup

  1. Clone the repository
  2. Install development dependencies: pip install -r requirements.txt
  3. Run tests: python -m pytest

Guidelines

  • Follow PEP 8 style guidelines
  • Add tests for new features
  • Update documentation as needed
  • Ensure all tests pass before submitting PRs

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Support

Acknowledgments

Built with performance and simplicity in mind for the NLP community.


Note: This library requires a C/CPP compiler for optimal performance. Fallback Python implementations are available when C/CPP extensions are not available.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shredword-0.1.1.tar.gz (615.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shredword-0.1.1-cp313-cp313-win_amd64.whl (106.0 kB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file shredword-0.1.1.tar.gz.

File metadata

  • Download URL: shredword-0.1.1.tar.gz
  • Upload date:
  • Size: 615.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for shredword-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e1ba91752d53b9864eacbc9f6b9eba6c0902ad60f7595aac4ccfca042f829b6b
MD5 0c3cc2b68c5caa7c957c07cd0ef3bf4f
BLAKE2b-256 90b765ea2c3f471cee8471d7c99a00302725ada99a0676248beb261b598ed36b

See more details on using hashes here.

File details

Details for the file shredword-0.1.1-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: shredword-0.1.1-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 106.0 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for shredword-0.1.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 c421b981024a27d49c6e06abcc6597c3fa9b8ae66174919600d4c19d1c43d0af
MD5 ce1962835212e50bb00e857105a3c5ca
BLAKE2b-256 4d688c987617c523b9cdb1bbe75854ac1f5a2e8d3d9b963bf5de85de0e0d0806

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page