A tiny byte-level multi-head content classifier
Project description
pico-type 🔍
A tiny byte-level multi-head content classifier — ~1.5M params, ~200KB ONNX, <12ms inference.
Classifies any content into 7 categories from raw bytes: coarse type, modality, subtype, code language, text language, file MIME, and risk flags.
✨ Features
- No tokenizer — operates directly on raw UTF-8 bytes (supports all languages)
- 7 heads, one forward pass — coarse type, modality, subtype, code lang, text lang, file MIME, risk
- 4 Matryoshka tiers — tiny (16d) → small (64d) → base (192d) → pro (576d)
- ~200KB ONNX — deploy on edge devices, serverless, browser (WebAssembly)
- <12ms inference on CPU via ONNX Runtime
- CLI, Gradio Space, MCP server — ready to use
📊 Performance
| Head | Classes | Accuracy |
|---|---|---|
| coarse | 12 | 100% |
| modality | 8 | 100% |
| subtype | 24 | 93.8% |
| code_lang | 62 | 41.7% |
| text_lang | 30 | 94.3% |
| file_mime | 90 | 100% |
| risk (mAP) | 6 | 100% |
500 evaluation samples, 1700 training steps, base tier, ~13ms inference.
🚀 Quick Start
CLI
pip install picotype
echo "def hello():\n return 42" | picotype --pretty
picotype --file document.txt
picotype --clip
Python
from model.pico_type.cli import load_onnx_model, run_onnx
session = load_onnx_model("base", "checkpoints")
result = run_onnx(session, "def hello(): pass")
print(result)
MCP Server (Claude/Cursor)
PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server
🏗 Architecture
Bytes → ByteEmbed(256→96d) → 3×Conv1D(k=3,5,7) → 2×BiAttention(RoPE) → Pool(mean‖max‖std) → 7×Matryoshka Heads
- ByteEmbed: lookup-free byte embedding (256 vocab, 96 dim)
- Conv1D: 3 parallel kernels (width 3, 5, 7) with residual + layer norm
- BiAttention: bidirectional self-attention with RoPE, 4 heads, 96 dim
- Pool: mean + max + std concatenation
- Matryoshka Heads: 4 slices of the pooled vector (16/64/192/576 dim) → 7 linear classifiers
Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)
🔧 Model Tiers
| Tier | Dim | Params | ONNX Size |
|---|---|---|---|
| tiny | 16 | 1.43M | 203 KB |
| small | 64 | 1.45M | 203 KB |
| base | 192 | 1.48M | 206 KB |
| pro | 576 | 1.56M | 202 KB |
All tiers share the same trunk; only the final linear layers differ.
🧪 Classification Heads
| Head | Classes | Examples |
|---|---|---|
| coarse | 12 | text, code, link, image, file, config, markup, data, error, secret, archive, binary |
| modality | 8 | textual, binary_image, binary_archive, binary_executable, etc. |
| subtype | 24 | json, yaml, toml, csv, html, markdown, sql, log, dockerfile, etc. |
| code_lang | 62 | python, javascript, typescript, java, c, cpp, go, rust, etc. |
| text_lang | 30 | en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi, etc. |
| file_mime | 90 | text/html, application/json, application/pdf, image/png, video/mp4, etc. |
| risk | 6 | api_key, jwt, password, email, phone, ssh_key |
🌐 Deployment
| Platform | Location |
|---|---|
| HuggingFace Space | eulogik/pico-type |
| HuggingFace Model | eulogik/pico-type |
| GitHub | eulogik/pico-type |
| PyPI | pip install picotype |
📚 Documentation
- Model Card — detailed architecture, training, and evaluation
- Architecture Plan — full design document
- Walkthrough — development log
📄 License
Apache 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pico_type-0.1.3.tar.gz.
File metadata
- Download URL: pico_type-0.1.3.tar.gz
- Upload date:
- Size: 45.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
432a00b8ec49ea898f7b94913e34415788bb3e80d4056709c00cbba7090033b4
|
|
| MD5 |
c5ca0f58fe1ef23c637be3c754f5f82f
|
|
| BLAKE2b-256 |
9059f2df226d5c6fbdca1060ab58ebf167e9e5797f0db3a79f539eacd38db9d1
|
File details
Details for the file pico_type-0.1.3-py3-none-any.whl.
File metadata
- Download URL: pico_type-0.1.3-py3-none-any.whl
- Upload date:
- Size: 48.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
186aae931ab63b8f68539c81b60d1a149690d02fecfa6e306f38b063af0624a1
|
|
| MD5 |
537ede2959a7fc0c8953479bda83df12
|
|
| BLAKE2b-256 |
3091857b7cc4e7d16de6807c0c623dfff22dd038caf683a1d519b9e95b000d7e
|