Skip to main content

A tiny byte-level multi-head content classifier

Project description


license: apache-2.0 language:

  • en
  • multilingual tags:
  • byte-level
  • content-classification
  • onnx
  • edge-ai
  • matryoshka pipeline_tag: text-classification library_name: generic inference: parameters: provider: CPUExecutionProvider

pico-type 🔍

A tiny byte-level multi-head content classifier — ~1.5M params, ~209KB ONNX, <6ms inference.

Classifies any content into 7 categories from raw bytes: coarse type, modality, subtype, code language, text language, file MIME, and risk flags.

License Python ONNX HuggingFace Space GitHub


✨ Features

  • No tokenizer — operates directly on raw UTF-8 bytes (supports all languages)
  • 7 heads, one forward pass — coarse type, modality, subtype, code lang, text lang, file MIME, risk
  • 4 Matryoshka tiers — tiny (16d) → small (64d) → base (192d) → pro (576d)
  • ~200KB ONNX — deploy on edge devices, serverless, browser (WebAssembly)
  • <12ms inference on CPU via ONNX Runtime
  • CLI, Gradio Space, MCP server — ready to use

📊 Performance

Head Classes Accuracy
coarse 12 100%
modality 8 100%
subtype 24 98.4%
code_lang 62 53.9%
text_lang 30 100%
file_mime 90 100%
risk (mAP) 6 100%

1000 evaluation samples, 9000 training steps (5000 synthetic + 4000 real-code fine-tune), base tier, ~5.6ms inference.

🚀 Quick Start

CLI

pip install picotype

echo "def hello():\n    return 42" | picotype --pretty
picotype --file document.txt
picotype --clip

Python

from model.pico_type.cli import load_onnx_model, run_onnx

session = load_onnx_model("base", "checkpoints")
result = run_onnx(session, "def hello(): pass")
print(result)

MCP Server (Claude/Cursor)

PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server

🏗 Architecture

Bytes → ByteEmbed(256→96d) → 3×Conv1D(k=3,5,7) → 2×BiAttention(RoPE) → Pool(mean‖max‖std) → 7×Matryoshka Heads
  • ByteEmbed: lookup-free byte embedding (256 vocab, 96 dim)
  • Conv1D: 3 parallel kernels (width 3, 5, 7) with residual + layer norm
  • BiAttention: bidirectional self-attention with RoPE, 4 heads, 96 dim
  • Pool: mean + max + std concatenation
  • Matryoshka Heads: 4 slices of the pooled vector (16/64/192/576 dim) → 7 linear classifiers

Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)

🔧 Model Tiers

Tier Dim Params ONNX Size
tiny 16 1.43M 207 KB
small 64 1.45M 207 KB
base 192 1.48M 209 KB
pro 576 1.56M 206 KB

All tiers share the same trunk; only the final linear layers differ.

🧪 Classification Heads

Head Classes Examples
coarse 12 text, code, link, image, file, config, markup, data, error, secret, archive, binary
modality 8 textual, binary_image, binary_archive, binary_executable, etc.
subtype 24 json, yaml, toml, csv, html, markdown, sql, log, dockerfile, etc.
code_lang 62 python, javascript, typescript, java, c, cpp, go, rust, etc.
text_lang 30 en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi, etc.
file_mime 90 text/html, application/json, application/pdf, image/png, video/mp4, etc.
risk 6 api_key, jwt, password, email, phone, ssh_key

🌐 Deployment

Platform Location
HuggingFace Space eulogik/pico-type
HuggingFace Model eulogik/pico-type
GitHub eulogik/pico-type
PyPI pip install picotype

📚 Documentation

📄 License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pico_type-0.1.4.tar.gz (46.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pico_type-0.1.4-py3-none-any.whl (48.5 kB view details)

Uploaded Python 3

File details

Details for the file pico_type-0.1.4.tar.gz.

File metadata

  • Download URL: pico_type-0.1.4.tar.gz
  • Upload date:
  • Size: 46.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pico_type-0.1.4.tar.gz
Algorithm Hash digest
SHA256 83a04946e80fb06ba5d349fae4a6cb80297f0da633d5f529da5d8bd932451581
MD5 2d7f45f9b075d3d7172364fc58f10958
BLAKE2b-256 1796bd7733c57a18714746e0be204c8439e3c9e712709cd9230eb720806178dd

See more details on using hashes here.

File details

Details for the file pico_type-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: pico_type-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 48.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pico_type-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 6b91cd37f79136c33a189a2ef8686e3ae1521ab169c9d34470fc555ef15f92e4
MD5 515a7c40a8c483c2390738530123aa69
BLAKE2b-256 56b7316fe4a0534aee18c91c7dcd3ac907f12f90c832add8d235953fb21ba8b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page