Skip to main content

A tiny byte-level multi-head content classifier

Project description


license: apache-2.0 language:

  • en
  • multilingual tags:
  • byte-level
  • content-classification
  • onnx
  • edge-ai
  • matryoshka pipeline_tag: text-classification library_name: generic inference: parameters: provider: CPUExecutionProvider

pico-type 🔍

A tiny byte-level multi-head content classifier — ~1.5M params, ~209KB ONNX, <6ms inference.

Classifies any content into 7 categories from raw bytes: coarse type, modality, subtype, code language, text language, file MIME, and risk flags.

License Python ONNX HuggingFace Space GitHub


✨ Features

  • No tokenizer — operates directly on raw UTF-8 bytes (supports all languages)
  • 7 heads, one forward pass — coarse type, modality, subtype, code lang, text lang, file MIME, risk
  • 4 Matryoshka tiers — tiny (16d) → small (64d) → base (192d) → pro (576d)
  • ~200KB ONNX — deploy on edge devices, serverless, browser (WebAssembly)
  • <12ms inference on CPU via ONNX Runtime
  • CLI, Gradio Space, MCP server — ready to use

📊 Performance

Head Classes Accuracy
coarse 12 100%
modality 8 100%
subtype 24 98.8%
code_lang 62 61.3%
text_lang 30 100%
file_mime 90 100%
risk (mAP) 6 100%

1000 evaluation samples, base tier, ~5.6ms inference. Hindi (hi) added as a text language.

🚀 Quick Start

CLI

pip install picotype

echo "def hello():\n    return 42" | picotype --pretty
picotype --file document.txt
picotype --clip

Python

from model.pico_type.cli import load_onnx_model, run_onnx

session = load_onnx_model("base", "checkpoints")
result = run_onnx(session, "def hello(): pass")
print(result)

MCP Server (Claude/Cursor)

PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server

🏗 Architecture

Bytes → ByteEmbed(256→96d) → 3×Conv1D(k=3,5,7) → 2×BiAttention(RoPE) → Pool(mean‖max‖std) → 7×Matryoshka Heads
  • ByteEmbed: lookup-free byte embedding (256 vocab, 96 dim)
  • Conv1D: 3 parallel kernels (width 3, 5, 7) with residual + layer norm
  • BiAttention: bidirectional self-attention with RoPE, 4 heads, 96 dim
  • Pool: mean + max + std concatenation
  • Matryoshka Heads: 4 slices of the pooled vector (16/64/192/576 dim) → 7 linear classifiers

Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)

🔧 Model Tiers

Tier Dim Params ONNX Size
tiny 16 1.43M 207 KB
small 64 1.45M 207 KB
base 192 1.48M 209 KB
pro 576 1.56M 206 KB

All tiers share the same trunk; only the final linear layers differ.

🧪 Classification Heads

Head Classes Examples
coarse 12 text, code, link, image, file, config, markup, data, error, secret, archive, binary
modality 8 textual, binary_image, binary_archive, binary_executable, etc.
subtype 24 json, yaml, toml, csv, html, markdown, sql, log, dockerfile, etc.
code_lang 62 python, javascript, typescript, java, c, cpp, go, rust, etc.
text_lang 30 en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi, etc.
file_mime 90 text/html, application/json, application/pdf, image/png, video/mp4, etc.
risk 6 api_key, jwt, password, email, phone, ssh_key

🌐 Deployment

Platform Location
HuggingFace Space eulogik/pico-type
HuggingFace Model eulogik/pico-type
GitHub eulogik/pico-type
PyPI pip install picotype

📚 Documentation

📄 License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pico_type-0.1.5.tar.gz (46.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pico_type-0.1.5-py3-none-any.whl (48.5 kB view details)

Uploaded Python 3

File details

Details for the file pico_type-0.1.5.tar.gz.

File metadata

  • Download URL: pico_type-0.1.5.tar.gz
  • Upload date:
  • Size: 46.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pico_type-0.1.5.tar.gz
Algorithm Hash digest
SHA256 ab052fb6aba68aa22b637df4132c6e1c6ba0a981cd72a8e25dd0dbeba86e0763
MD5 8674870cd5f363f541b208f094c7724a
BLAKE2b-256 43ac5f8c3add79f4116902f0a489eaec3b8822972fbe4516026a1b8d306015da

See more details on using hashes here.

File details

Details for the file pico_type-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: pico_type-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 48.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pico_type-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 4390b811a9f8150eaf6948c2824b10d708dc48fff57ed97932c0442bc6a61bb0
MD5 7b07de71cf58fe570ea2bd6dd96574b4
BLAKE2b-256 0e0f118d24d4416b48c5918c0706df455aa3bca50ce64635e680d6faf88a1985

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page