Fast token count estimation library
Project description
skimtoken (Early Beta)
⚠️ WARNING: This is an early beta version. The current implementation is not production-ready.
A lightweight, fast token count estimation library written in Rust with Python bindings.
Why skimtoken?
The Problem: tiktoken is great for precise tokenization, but requires ~59.6MB of memory just to count tokens - problematic for memory-constrained environments.
The Solution: skimtoken estimates token counts using statistical patterns instead of loading entire vocabularies, achieving:
- ✅ 65x less memory (0.92MB vs 59.6MB)
- ✅ 421x faster startup (2.389ms vs 1,005ms)
- ❌ 1.03x slowwer execute time (6.689s vs 6.912s) for Multilingual single method
- ❌ Trade-off: ~15.11% error rate vs exact counts
Installation
pip install skimtoken
Requirements: Python 3.9+
Quick Start
Simple method (Just char length x coefficient):
from skimtoken import estimate_tokens
# Basic usage
text = "Hello, world! How are you today?"
token_count = estimate_tokens(text)
print(f"Estimated tokens: {token_count}")
Multilingual simple method:
from skimtoken.multilingual_single import estimate_tokens
multilingual_text = """
For non-space separated languages, the number of tokens is difficult to predict.
スペースで区切られていない言語の場合トークン数を予測するのは難しいです。
स्पेसद्वारावियोजितनहींभाषाओंकेलिएटोकनकीसंख्याकाअनुमानलगानाकठिनहै।
بالنسبةللغاتالتيلاتفصلبمسافاتفإنالتنبؤبعددالرموزصعب
"""
token_count = estimate_tokens(multilingual_text)
print(f"Estimated tokens (multilingual): {token_count}")
When to Use skimtoken
✅ Perfect for:
| Use Case | Why It Works | Example |
|---|---|---|
| Rate Limiting | Overestimating is safe | Prevent API quota exceeded |
| Cost Estimation | Users prefer conservative estimates | "$0.13" (actual: $0.10) |
| Progress Bars | Approximate progress is fine | Processing documents |
| Serverless/Edge | Memory constraints (128MB limits) | Cloudflare Workers |
| Quick Filtering | Remove obviously too-long content | Pre-screening |
| Model Switching | Switch to smart model when context long | Auto-escalation |
❌ Not suitable for:
| Use Case | Why It Fails | Use Instead |
|---|---|---|
| Context Limits | Underestimating causes failures | tiktoken |
| Exact Billing | 15% error = unhappy customers | tiktoken |
| Token Splitting | Chunks might exceed limits | tiktoken |
| Embeddings | Need exact token boundaries | tiktoken |
Performance Comparison
Large-Scale Benchmark (100k samples)
Multilingual single method:
Results:
Total Samples: 100,726
Total Characters: 13,062,391
Mean RMSE: 21.3034 tokens
Mean Error Rate: 15.11%
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┓
┃ Metric ┃ tiktoken ┃ skimtoken ┃ Ratio ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━┩
│ Init Time │ 1.005490 s │ 0.002389 s │ 0.002x │
├──────────────┼────────────┼────────────┼────────┤
│ Init Memory │ 42.2310 MB │ 0.0265 MB │ 0.001x │
├──────────────┼────────────┼────────────┼────────┤
│ Exec Time │ 6.689203 s │ 6.911931 s │ 1.033x │
├──────────────┼────────────┼────────────┼────────┤
│ Exec Memory │ 17.3251 MB │ 0.8950 MB │ 0.052x │
├──────────────┼────────────┼────────────┼────────┤
│ Total Time │ 7.694694 s │ 6.914320 s │ 0.899x │
├──────────────┼────────────┼────────────┼────────┤
│ Total Memory │ 59.5561 MB │ 0.9215 MB │ 0.015x │
└──────────────┴────────────┴────────────┴────────┘
Automated Benchmarks
For up-to-date performance comparisons and detailed accuracy metrics across all methods, visit the skimtoken_benchmark repository. This automated benchmark suite:
- Uses the CC-100 multilingual dataset (100k+ samples)
- Provides language-specific accuracy breakdowns
Available Methods
| Method | Import | Memory | Error | Best For |
|---|---|---|---|---|
| Simple | from skimtoken.simple import estimate_tokens |
1.0MB | ~21.63% | English text, minimum memory |
| Basic | from skimtoken.basic import estimate_tokens |
0.9MB | ~27.05% | General use |
| Multilingual | from skimtoken.multilingual import estimate_tokens |
0.9MB | ~15.93% | Non-English, mixed languages |
| Multilingual Simple | from skimtoken.multilingual_simple import estimate_tokens |
0.9MB | ~15.11% | Fast multilingual estimation |
# Example: Choose method based on your needs
if memory_critical:
from skimtoken.simple import estimate_tokens
elif mixed_languages:
from skimtoken.multilingual import estimate_tokens
else:
from skimtoken import estimate_tokens # Default: simple
CLI Usage
# From command line
echo "Hello, world!" | skimtoken
# Output: 5
# From file
skimtoken -f document.txt
# Output: 236
# Multiple files
cat *.md | skimtoken
# Output: 4846
How It Works
Unlike tiktoken's vocabulary-based approach, skimtoken uses statistical patterns:
tiktoken:
Text → Tokenizer → ["Hello", ",", " world"] → Vocabulary Lookup → [1234, 11, 4567] → Count: 3
↑
Requires 60MB dictionary
skimtoken:
Text → Feature Extraction → {chars: 13, words: 2, lang: "en"} → Statistical Model → ~3 tokens
↑
Only 0.92MB of parameters
Advanced Usage
Optimize for Your Domain
Improve accuracy on domain-specific content:
# 1. Prepare labeled data
# Format: {"text": "your content", "actual_tokens": 123}
uv run scripts/prepare_dataset.py --input your_texts.txt
# 2. Optimize parameters
uv run scripts/optimize_all.py --dataset your_data.jsonl
# 3. Rebuild with custom parameters
uv run maturin build --release
Architecture
skimtoken/
├── src/
│ ├── lib.rs # Core Rust library with PyO3 bindings
│ └── methods/
│ ├── method_simple.rs # Character-based estimation
│ ├── method_basic.rs # Multi-feature regression
│ └── method_multilingual.rs # Language-aware estimation
├── skimtoken/ # Python package
│ ├── __init__.py # Main API
│ └── {method}.py # Method-specific imports
├── params/ # Learned parameters (TOML)
└── scripts/
├── benchmark.py # Performance testing
└── optimize/ # Parameter training
Development
# Setup
git clone https://github.com/masaishi/skimtoken
cd skimtoken
uv sync
# Development build
uv run maturin dev --features python
# Run tests
cargo test
uv run pytest
# Benchmark
uv run scripts/benchmark.py
FAQ
Q: Can I improve accuracy?
A: Yes! You can adjust the parameters using your own data to improve accuracy. See Advanced Usage for details.
Q: Is the API stable?
A: Beta = breaking changes possible.
Future Plans
We are actively working to improve skimtoken's accuracy and performance:
- Better estimation algorithms: Moving beyond simple character multiplication to more sophisticated statistical models
- Performance optimization: Further improving execution speed
- Improved language support: Better handling of non-English languages
- Higher accuracy: Targeting <10% error rate while maintaining low memory footprint
License
MIT License - see LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file skimtoken-0.2.2.tar.gz.
File metadata
- Download URL: skimtoken-0.2.2.tar.gz
- Upload date:
- Size: 213.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1446237248d6b342cfe90ab7baab5f235cd28db9afdfe58d1e8ba774d5b6d0f
|
|
| MD5 |
6603f399f1acf1c48673cf521a7173af
|
|
| BLAKE2b-256 |
1dec090eea9b57582dc2c1f6a5dc83356915e3fc03af11c8e7f8cde3dc39d26e
|
File details
Details for the file skimtoken-0.2.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: skimtoken-0.2.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 478.5 kB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61968513b6d742b863a3cc1dacfad72e29f4ea6e833e208504db4fbe5a434db4
|
|
| MD5 |
a72c9b76220f6714af43459d0138b0b9
|
|
| BLAKE2b-256 |
7832dd328ffaf7e0f8608f2cf0ad78199f4fa7c8935d77ad9434a44a0694cefb
|