A tiny byte-level multi-head content classifier
Project description
license: apache-2.0 language:
- en
- multilingual tags:
- byte-level
- content-classification
- onnx
- edge-ai
- matryoshka
- multi-head
- classifier
- clipboard pipeline_tag: text-classification library_name: pico-type inference: parameters: provider: CPUExecutionProvider
pico-type 🔍
A tiny byte-level multi-head content classifier — ~1.5M params, ~209KB ONNX, <6ms inference.
Classifies any content into 7 categories from raw bytes in a single forward pass.
Built by eulogik — AI infrastructure for developers.
✨ Features
- No tokenizer — operates directly on raw UTF-8 bytes (supports all languages, zero pre-processing)
- 7 heads, one forward pass — coarse type, modality, subtype, code lang, text lang, file MIME, risk flags
- 4 Matryoshka tiers — tiny (16d) → small (64d) → base (192d) → pro (576d)
- ~200KB ONNX — deploy on edge devices, serverless functions, browser (WebAssembly)
- <6ms inference on CPU via ONNX Runtime (base tier, 1024 bytes)
- CLI, Gradio Space, MCP server — ready for any integration
📊 Performance
| Head | Classes | Synthetic Accuracy | Real-World Accuracy |
|---|---|---|---|
| coarse | 12 | 100% | 86% |
| modality | 8 | 100% | 100% |
| subtype | 24 | 95% | — |
| code_lang | 62 | 39% | — |
| text_lang | 30 | 99% | 100% |
| file_mime | 90 | 100% | — |
| risk (mAP) | 6 | 100% | — |
Evaluated on 1000 synthetic samples + 21 hand-curated real-world inputs. Base tier, ~5ms inference.
Note: code_lang synthetic accuracy reflects the challenge of 62-way classification with limited per-class support. Real-world accuracy across all heads is 52% (11/21 correct), up from 23% baseline before diverse training.
🚀 Quick Start
CLI
pip install picotype
echo "def hello():\n return 42" | picotype --pretty
picotype --file document.txt
picotype --clip
Python
from picotype import PicoType, PicoTypeConfig, decode_output
model = PicoType(PicoTypeConfig()).eval()
# ... load checkpoint ...
result = decode_output(model(b"input bytes"), tier="base")
MCP Server (Claude/Cursor)
PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server
🏗 Architecture
Bytes → ByteEmbed(256→96d) → 3×Conv1D(k=3,5,7) → 2×BiAttention(RoPE) → Pool(mean‖max‖std) → 7×Matryoshka Heads
| Component | Description |
|---|---|
| ByteEmbed | nn.Embedding(256, 96) — lookup-free byte embedding |
| Conv1D | 3 parallel kernels (width 3, 5, 7) with residual + LayerNorm + GELU |
| BiAttention | Bidirectional self-attention with Rotary Position Embeddings, 4 heads |
| Pool | Mean + Max + Std concatenation over masked positions |
| Matryoshka Heads | 4 tier slices of the pooled vector → 7 linear classifiers |
Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)
🔧 Model Tiers
| Tier | Dim | Params | ONNX Size | Speed |
|---|---|---|---|---|
| tiny | 16 | 1.43M | 207 KB | ~3ms |
| small | 64 | 1.45M | 207 KB | ~4ms |
| base | 192 | 1.48M | 209 KB | ~5ms |
| pro | 576 | 1.56M | 206 KB | ~12ms |
All tiers share the same trunk; only the final linear layers differ. Switch tiers at inference with zero overhead.
🧪 Classification Heads
| Head | Classes | Gated By | Examples |
|---|---|---|---|
| coarse | 12 | — | text, code, link, image, file, config, markup, data, error, secret, archive, binary |
| modality | 8 | — | textual, binary_image, binary_archive, binary_executable, binary_document, binary_audio, binary_video, binary_other |
| subtype | 24 | config, markup, data | json, yaml, toml, csv, html, markdown, sql, log, dockerfile |
| code_lang | 62 | code | python, javascript, typescript, java, c, cpp, go, rust, kotlin, swift, bash, sql |
| text_lang | 30 | text | en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi |
| file_mime | 90 | image, file | text/html, application/json, application/pdf, image/png, video/mp4 |
| risk | 6 | — | api_key, jwt, password, email, phone, ssh_key (probabilities) |
🌐 Deployment
| Platform | URL |
|---|---|
| HuggingFace Space | eulogik/pico-type |
| HuggingFace Model | eulogik/pico-type |
| GitHub | eulogik/pico-type |
| PyPI | pip install picotype |
| Zenodo | 10.5281/zenodo.20758542 |
📚 Documentation
- Model Card — detailed architecture, training, evaluation
- Architecture Plan — full design document
- Walkthrough — development log with all decisions
📄 License
Apache 2.0 — free for commercial and personal use.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pico_type-0.1.6.tar.gz.
File metadata
- Download URL: pico_type-0.1.6.tar.gz
- Upload date:
- Size: 67.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad296a4c736b6c839e4f25b2b74dbed72394486caf3f01b6b053e4bb77ced524
|
|
| MD5 |
fa9181c8c1c4a6f881b57edd14f0a009
|
|
| BLAKE2b-256 |
8ae351d49b0966ec1eb2c2f11e0f3174f658822cb5c4fb9372ee264f3ee8734c
|
File details
Details for the file pico_type-0.1.6-py3-none-any.whl.
File metadata
- Download URL: pico_type-0.1.6-py3-none-any.whl
- Upload date:
- Size: 82.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9841a812ca2deff9edd2eca58fd59c63dab18df6492f3837cddde5caa02638fc
|
|
| MD5 |
09bfa2b0a66630ad4cd5397b98c0549b
|
|
| BLAKE2b-256 |
103f5ea645966d5f01ad4fc2cb12bbe1b6feb567af9c4dd4008fb716aa6d4a4c
|