Skip to main content

A tiny byte-level multi-head content classifier

Project description

pico-type 🔍

A tiny byte-level multi-head content classifier — ~1.5M params, ~200KB ONNX, <12ms inference.

Classifies any content into 7 categories from raw bytes: coarse type, modality, subtype, code language, text language, file MIME, and risk flags.

License Python ONNX HuggingFace Space GitHub


✨ Features

  • No tokenizer — operates directly on raw UTF-8 bytes (supports all languages)
  • 7 heads, one forward pass — coarse type, modality, subtype, code lang, text lang, file MIME, risk
  • 4 Matryoshka tiers — tiny (16d) → small (64d) → base (192d) → pro (576d)
  • ~200KB ONNX — deploy on edge devices, serverless, browser (WebAssembly)
  • <12ms inference on CPU via ONNX Runtime
  • CLI, Gradio Space, MCP server — ready to use

📊 Performance

Head Classes Accuracy
coarse 12 100%
modality 8 100%
subtype 24 93.8%
code_lang 62 41.7%
text_lang 30 94.3%
file_mime 90 100%
risk (mAP) 6 100%

500 evaluation samples, 1700 training steps, base tier, ~13ms inference.

🚀 Quick Start

CLI

pip install picotype

echo "def hello():\n    return 42" | picotype --pretty
picotype --file document.txt
picotype --clip

Python

from model.pico_type.cli import load_onnx_model, run_onnx

session = load_onnx_model("base", "checkpoints")
result = run_onnx(session, "def hello(): pass")
print(result)

MCP Server (Claude/Cursor)

PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server

🏗 Architecture

Bytes → ByteEmbed(256→96d) → 3×Conv1D(k=3,5,7) → 2×BiAttention(RoPE) → Pool(mean‖max‖std) → 7×Matryoshka Heads
  • ByteEmbed: lookup-free byte embedding (256 vocab, 96 dim)
  • Conv1D: 3 parallel kernels (width 3, 5, 7) with residual + layer norm
  • BiAttention: bidirectional self-attention with RoPE, 4 heads, 96 dim
  • Pool: mean + max + std concatenation
  • Matryoshka Heads: 4 slices of the pooled vector (16/64/192/576 dim) → 7 linear classifiers

Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)

🔧 Model Tiers

Tier Dim Params ONNX Size
tiny 16 1.43M 203 KB
small 64 1.45M 203 KB
base 192 1.48M 206 KB
pro 576 1.56M 202 KB

All tiers share the same trunk; only the final linear layers differ.

🧪 Classification Heads

Head Classes Examples
coarse 12 text, code, link, image, file, config, markup, data, error, secret, archive, binary
modality 8 textual, binary_image, binary_archive, binary_executable, etc.
subtype 24 json, yaml, toml, csv, html, markdown, sql, log, dockerfile, etc.
code_lang 62 python, javascript, typescript, java, c, cpp, go, rust, etc.
text_lang 30 en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi, etc.
file_mime 90 text/html, application/json, application/pdf, image/png, video/mp4, etc.
risk 6 api_key, jwt, password, email, phone, ssh_key

🌐 Deployment

Platform Location
HuggingFace Space eulogik/pico-type
HuggingFace Model eulogik/pico-type
GitHub eulogik/pico-type
PyPI pip install picotype

📚 Documentation

📄 License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pico_type-0.1.3.tar.gz (45.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pico_type-0.1.3-py3-none-any.whl (48.2 kB view details)

Uploaded Python 3

File details

Details for the file pico_type-0.1.3.tar.gz.

File metadata

  • Download URL: pico_type-0.1.3.tar.gz
  • Upload date:
  • Size: 45.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pico_type-0.1.3.tar.gz
Algorithm Hash digest
SHA256 432a00b8ec49ea898f7b94913e34415788bb3e80d4056709c00cbba7090033b4
MD5 c5ca0f58fe1ef23c637be3c754f5f82f
BLAKE2b-256 9059f2df226d5c6fbdca1060ab58ebf167e9e5797f0db3a79f539eacd38db9d1

See more details on using hashes here.

File details

Details for the file pico_type-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: pico_type-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 48.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pico_type-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 186aae931ab63b8f68539c81b60d1a149690d02fecfa6e306f38b063af0624a1
MD5 537ede2959a7fc0c8953479bda83df12
BLAKE2b-256 3091857b7cc4e7d16de6807c0c623dfff22dd038caf683a1d519b9e95b000d7e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page