Skip to main content

A tiny byte-level multi-head content classifier

Project description

pico-type 🔍

A tiny byte-level multi-head content classifier — ~1.5M params, ~200KB ONNX, <12ms inference.

Classifies any content into 7 categories from raw bytes: coarse type, modality, subtype, code language, text language, file MIME, and risk flags.

License Python ONNX HuggingFace Space GitHub


✨ Features

  • No tokenizer — operates directly on raw UTF-8 bytes (supports all languages)
  • 7 heads, one forward pass — coarse type, modality, subtype, code lang, text lang, file MIME, risk
  • 4 Matryoshka tiers — tiny (16d) → small (64d) → base (192d) → pro (576d)
  • ~200KB ONNX — deploy on edge devices, serverless, browser (WebAssembly)
  • <12ms inference on CPU via ONNX Runtime
  • CLI, Gradio Space, MCP server — ready to use

📊 Performance

Head Classes Accuracy
coarse 12 100%
modality 8 100%
subtype 24 98.4%
code_lang 62 54.2%
text_lang 30 88.6%
file_mime 90 100%
risk (mAP) 6 99.6%

500 evaluation samples, 800 training steps, base tier, 8ms inference.

🚀 Quick Start

CLI

pip install picotype

echo "def hello():\n    return 42" | picotype --pretty
picotype --file document.txt
picotype --clip

Python

from model.pico_type.cli import load_onnx_model, run_onnx

session = load_onnx_model("base", "checkpoints")
result = run_onnx(session, "def hello(): pass")
print(result)

MCP Server (Claude/Cursor)

PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server

🏗 Architecture

Bytes → ByteEmbed(256→96d) → 3×Conv1D(k=3,5,7) → 2×BiAttention(RoPE) → Pool(mean‖max‖std) → 7×Matryoshka Heads
  • ByteEmbed: lookup-free byte embedding (256 vocab, 96 dim)
  • Conv1D: 3 parallel kernels (width 3, 5, 7) with residual + layer norm
  • BiAttention: bidirectional self-attention with RoPE, 4 heads, 96 dim
  • Pool: mean + max + std concatenation
  • Matryoshka Heads: 4 slices of the pooled vector (16/64/192/576 dim) → 7 linear classifiers

Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)

🔧 Model Tiers

Tier Dim Params ONNX Size
tiny 16 1.43M 203 KB
small 64 1.45M 203 KB
base 192 1.48M 206 KB
pro 576 1.56M 202 KB

All tiers share the same trunk; only the final linear layers differ.

🧪 Classification Heads

Head Classes Examples
coarse 12 text, code, link, image, file, config, markup, data, error, secret, archive, binary
modality 8 textual, binary_image, binary_archive, binary_executable, etc.
subtype 24 json, yaml, toml, csv, html, markdown, sql, log, dockerfile, etc.
code_lang 62 python, javascript, typescript, java, c, cpp, go, rust, etc.
text_lang 30 en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi, etc.
file_mime 90 text/html, application/json, application/pdf, image/png, video/mp4, etc.
risk 6 api_key, jwt, password, email, phone, ssh_key

🌐 Deployment

Platform Location
HuggingFace Space eulogik/pico-type
HuggingFace Model eulogik/pico-type
GitHub eulogik/pico-type
PyPI pip install picotype

📚 Documentation

📄 License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pico_type-0.1.1.tar.gz (40.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pico_type-0.1.1-py3-none-any.whl (41.3 kB view details)

Uploaded Python 3

File details

Details for the file pico_type-0.1.1.tar.gz.

File metadata

  • Download URL: pico_type-0.1.1.tar.gz
  • Upload date:
  • Size: 40.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for pico_type-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e1bda720ca3c5b22ab7c5f659474d482bbe4de7b329d38806e8bb06d046b79a5
MD5 2fc02d5179bb1fe7d9e28f083ba75386
BLAKE2b-256 44fd27de24c17d3abbe84bac36afebea57fe1098693e0fe4a89616919a4dfa59

See more details on using hashes here.

File details

Details for the file pico_type-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pico_type-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 41.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for pico_type-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 123e5ec51fd81645652f164d106b66444b9a91bed1563453ea91bc4a1ab7e2f7
MD5 953bc5d1f9c44911458e16f386ecb823
BLAKE2b-256 1d91ef925e8dbe18f2a24e62512bc64213a63643195bbdb1abc23a3417f413ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page