Skip to main content

fast bpe, unigram & word tokenizer for experimental models

Project description

Shredword

A fast and efficient tokenizer library for natural language processing tasks, built with Python and optimized C backend.

Features

  • High Performance: Fast tokenization powered by optimized C libraries
  • Multiple Encodings: Support for various tokenization models and vocabularies
  • Flexible API: Easy-to-use Python interface with comprehensive functionality
  • Special Tokens: Built-in support for special tokens and custom vocabularies
  • Fallback Mechanisms: Robust error handling with fallback tokenization
  • BPE Support: Byte Pair Encoding implementation for subword tokenization
  • Word Tokenization: Fast word-level tokenization with contraction handling
  • TF-IDF Embeddings: Built-in TF-IDF vectorization with dense and sparse representations

Installation

pip install shredword

Quick Start

BPE Tokenization

from shred import load_encoding

tokenizer = load_encoding("pre_16k")

tokens = tokenizer.encode("Hello, world!")
print(tokens)

text = tokenizer.decode(tokens)
print(text)

print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Special tokens: {tokenizer.special_tokens}")

Word Tokenization & TF-IDF Embeddings

from shred import WordTokenizer, TfidfEmbedding

tokenizer = WordTokenizer()
tokens = tokenizer.tokenize("Hello, world! This is a test.")
print(tokens)

embedding = TfidfEmbedding()
embedding.add_documents([
  "The quick brown fox jumps over the lazy dog",
  "Python programming is fun and exciting"
])

ids = embedding.encode_ids("The lazy fox")
dense_vec = embedding.encode_tfidf_dense("The lazy fox")
indices, values = embedding.encode_tfidf_sparse("The lazy fox")

embedding.save("vocab.txt")
loaded = TfidfEmbedding.load("vocab.txt")

Documentation

For detailed usage instructions, API reference, and examples, please see our User Documentation.

Supported Encodings

Shredword supports various pre-trained tokenization models. The library automatically downloads vocabulary files from the official repository when needed.

Contributing

We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.

Development Setup

  1. Clone the repository
  2. Install development dependencies: pip install -r requirements.txt (there are none!)
  3. Run tests: python -m pytest

Guidelines

  • Follow PEP 8 style guidelines
  • Add tests for new features
  • Update documentation as needed
  • Ensure all tests pass before submitting PRs

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shredword-0.1.2.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shredword-0.1.2-cp313-cp313-win_amd64.whl (115.6 kB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file shredword-0.1.2.tar.gz.

File metadata

  • Download URL: shredword-0.1.2.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for shredword-0.1.2.tar.gz
Algorithm Hash digest
SHA256 6502ecb064a76de10672b5fd86e589f1e87ed0f92cff3a8faaffc86c1a79e2e8
MD5 7d89364cd5d90f60fa3d938ab4dd71b6
BLAKE2b-256 e938c2cfd5ac39be559e1101978eaf74de3950800a1e2f4ca3f5ab40e0201901

See more details on using hashes here.

File details

Details for the file shredword-0.1.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: shredword-0.1.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 115.6 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for shredword-0.1.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 294c1e5f05bb1a55c4dec0e133b994706cf67818924a0d3287d6d651e073a4c7
MD5 a6f6f1e0338e6af1a1836af6d1726918
BLAKE2b-256 9dba1d4e06e3d4f03c78ecc38a2a3a396b3adea24160ef48fd087c02584d8974

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page