A tokenizer-free NLP library with T-FREE, CANINE, and byte-level approaches
Project description
Precious Package
Overview
The Precious package provides a minimal model showcasing three tokenizer-free approaches for natural language processing tasks. It includes implementations for T-FREE, CANINE, and byte-level embeddings, along with attention mechanisms for enhanced performance.
Installation
From PyPI (Recommended)
pip install precious-nlp
From Source (Development)
git clone https://github.com/bimri/precious.git
cd precious
pip install -e .
With Optional Dependencies
# For development tools
pip install precious-nlp[dev]
# For benchmarking
pip install precious-nlp[benchmarks]
# For documentation
pip install precious-nlp[docs]
# All optional dependencies
pip install precious-nlp[all]
Quick Start
Installation and Import
# Install the package
pip install precious-nlp
# Import the package (note: install as 'precious-nlp', import as 'precious')
import precious
from precious import PreciousModel, PreciousConfig
Usage
Here is a basic example of how to use the PreciousModel:
import precious
from precious import PreciousModel, PreciousConfig
# Initialize the model with the desired configuration
config = PreciousConfig(mode="byte", d_model=256) # or "tfree", "canine"
model = PreciousModel(config)
# Prepare your input data
inputs = ["Hello, tokenizer-free world!"]
outputs = model(inputs)
# Access the logits
logits = outputs["logits"]
print(f"Output shape: {logits.shape}") # [batch_size, seq_len, vocab_size]
# Training with targets
targets = ["Hello, tokenizer-free universe!"]
outputs = model(inputs, targets=targets)
loss = outputs["loss"]
print(f"Training loss: {loss.item()}")
Three Tokenizer-Free Approaches
1. Byte-Level Processing
import precious
config = precious.PreciousConfig(mode="byte", d_model=256)
model = precious.PreciousModel(config)
# Processes text at byte level - universal and memory efficient
2. CANINE Approach
import precious
config = precious.PreciousConfig(mode="canine", d_model=256)
model = precious.PreciousModel(config)
# Character-level processing with Unicode support
3. T-FREE Method
import precious
config = precious.PreciousConfig(mode="tfree", d_model=256, tfree_vocab_v=8192)
model = precious.PreciousModel(config)
# Vocabulary-aware with character-level fallback
Key Features
- 🚀 Three tokenizer-free approaches in one unified library
- 🎯 Production-ready with comprehensive testing and documentation
- 🌍 Universal text support - handles any Unicode text
- ⚡ Efficient processing with configurable model architectures
- 🧪 Research-friendly with benchmarking and comparison tools
- 📚 Well-documented with extensive examples and API reference
Quick Performance Comparison
| Mode | Memory | Speed | Best For |
|---|---|---|---|
| Byte | Lowest | Fastest | General purpose, production |
| CANINE | Medium | Medium | Multilingual, character-aware |
| T-FREE | Highest | Research | Vocabulary analysis, interpretability |
Documentation
For complete documentation, visit the docs directory or browse individual guides:
- 📖 API Reference - Complete API documentation
- 📝 Examples - From basic to advanced usage
Requirements
- Python >= 3.8
- PyTorch >= 1.9.0
- NumPy >= 1.19.0
Contributing
Contributions are welcome! Please follow these steps to contribute:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Make your changes and commit them.
- Push your branch and create a pull request.
License
This project is licensed under the MIT License. See the LICENSE file for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file precious_nlp-0.1.2.tar.gz.
File metadata
- Download URL: precious_nlp-0.1.2.tar.gz
- Upload date:
- Size: 17.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a62709d5602b2355d574432ac2f6470d3da07c629f0d03d0b33c874acdf88b17
|
|
| MD5 |
a337d4b0198f8819a242caa78ca37747
|
|
| BLAKE2b-256 |
f8f4c6e405ec4b536af7ff4b13292997ea740cc588bc8aeb81cdde41d6b2edd9
|
File details
Details for the file precious_nlp-0.1.2-py3-none-any.whl.
File metadata
- Download URL: precious_nlp-0.1.2-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d96d7c497d24cab668f6348a83ecf18c8bf8bd061ff6db7d9dca4fea73ec307f
|
|
| MD5 |
8b0c24457af58f3b1de5fc7c3f4ef079
|
|
| BLAKE2b-256 |
42dee8991113d1b107d0db1f8207e09d8d53b2b66a8c615f490f6d7af7c3c2ae
|