Skip to main content

A tiny byte-level multi-head content classifier

Project description

pico-type 🔍

A tiny byte-level multi-head content classifier — ~1.5M params, ~200KB ONNX, <12ms inference.

Classifies any content into 7 categories from raw bytes: coarse type, modality, subtype, code language, text language, file MIME, and risk flags.

License Python ONNX HuggingFace Space GitHub


✨ Features

  • No tokenizer — operates directly on raw UTF-8 bytes (supports all languages)
  • 7 heads, one forward pass — coarse type, modality, subtype, code lang, text lang, file MIME, risk
  • 4 Matryoshka tiers — tiny (16d) → small (64d) → base (192d) → pro (576d)
  • ~200KB ONNX — deploy on edge devices, serverless, browser (WebAssembly)
  • <12ms inference on CPU via ONNX Runtime
  • CLI, Gradio Space, MCP server — ready to use

📊 Performance

Head Classes Accuracy
coarse 12 100%
modality 8 100%
subtype 24 93.8%
code_lang 62 41.7%
text_lang 30 94.3%
file_mime 90 100%
risk (mAP) 6 100%

500 evaluation samples, 1700 training steps, base tier, ~13ms inference.

🚀 Quick Start

CLI

pip install picotype

echo "def hello():\n    return 42" | picotype --pretty
picotype --file document.txt
picotype --clip

Python

from model.pico_type.cli import load_onnx_model, run_onnx

session = load_onnx_model("base", "checkpoints")
result = run_onnx(session, "def hello(): pass")
print(result)

MCP Server (Claude/Cursor)

PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server

🏗 Architecture

Bytes → ByteEmbed(256→96d) → 3×Conv1D(k=3,5,7) → 2×BiAttention(RoPE) → Pool(mean‖max‖std) → 7×Matryoshka Heads
  • ByteEmbed: lookup-free byte embedding (256 vocab, 96 dim)
  • Conv1D: 3 parallel kernels (width 3, 5, 7) with residual + layer norm
  • BiAttention: bidirectional self-attention with RoPE, 4 heads, 96 dim
  • Pool: mean + max + std concatenation
  • Matryoshka Heads: 4 slices of the pooled vector (16/64/192/576 dim) → 7 linear classifiers

Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)

🔧 Model Tiers

Tier Dim Params ONNX Size
tiny 16 1.43M 203 KB
small 64 1.45M 203 KB
base 192 1.48M 206 KB
pro 576 1.56M 202 KB

All tiers share the same trunk; only the final linear layers differ.

🧪 Classification Heads

Head Classes Examples
coarse 12 text, code, link, image, file, config, markup, data, error, secret, archive, binary
modality 8 textual, binary_image, binary_archive, binary_executable, etc.
subtype 24 json, yaml, toml, csv, html, markdown, sql, log, dockerfile, etc.
code_lang 62 python, javascript, typescript, java, c, cpp, go, rust, etc.
text_lang 30 en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi, etc.
file_mime 90 text/html, application/json, application/pdf, image/png, video/mp4, etc.
risk 6 api_key, jwt, password, email, phone, ssh_key

🌐 Deployment

Platform Location
HuggingFace Space eulogik/pico-type
HuggingFace Model eulogik/pico-type
GitHub eulogik/pico-type
PyPI pip install picotype

📚 Documentation

📄 License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pico_type-0.1.2.tar.gz (40.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pico_type-0.1.2-py3-none-any.whl (41.3 kB view details)

Uploaded Python 3

File details

Details for the file pico_type-0.1.2.tar.gz.

File metadata

  • Download URL: pico_type-0.1.2.tar.gz
  • Upload date:
  • Size: 40.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pico_type-0.1.2.tar.gz
Algorithm Hash digest
SHA256 840b6d861af91b1941cf63e42eac23768416b8a59fc356ef374988f488f28854
MD5 bfbb83c182b235cb3bba54c2e2b88059
BLAKE2b-256 d8c6ead68d03230cda70d176bab5ce0544e95c1949381011dc2a8d48b1338569

See more details on using hashes here.

File details

Details for the file pico_type-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: pico_type-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 41.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pico_type-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6f7dbbcac01581d54c820857e9352c45646b96c978d4d8ca4f3231b229d21594
MD5 55c7cc3964ea06ac89afc2772e8bbb4b
BLAKE2b-256 ccd69531fcd4715827771dfe24c8b52a0e95a41e6531de4f7c70452578507a31

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page