Skip to main content

A tiny byte-level multi-head content classifier

Project description


license: apache-2.0 language:

  • en
  • multilingual tags:
  • byte-level
  • content-classification
  • onnx
  • edge-ai
  • matryoshka
  • multi-head
  • classifier
  • clipboard pipeline_tag: text-classification library_name: pico-type inference: parameters: provider: CPUExecutionProvider

pico-type 🔍

A tiny byte-level multi-head content classifier — ~1.5M params, ~209KB ONNX, <6ms inference.

Classifies any content into 7 categories from raw bytes in a single forward pass.

License Python ONNX PyPI HuggingFace Space HuggingFace Model GitHub CI DOI

Built by eulogik — AI infrastructure for developers.


✨ Features

  • No tokenizer — operates directly on raw UTF-8 bytes (supports all languages, zero pre-processing)
  • 7 heads, one forward pass — coarse type, modality, subtype, code lang, text lang, file MIME, risk flags
  • 4 Matryoshka tiers — tiny (16d) → small (64d) → base (192d) → pro (576d)
  • ~200KB ONNX — deploy on edge devices, serverless functions, browser (WebAssembly)
  • <6ms inference on CPU via ONNX Runtime (base tier, 1024 bytes)
  • CLI, Gradio Space, MCP server — ready for any integration

📊 Performance

Head Classes Synthetic Accuracy Real-World Accuracy
coarse 12 100% 86%
modality 8 100% 100%
subtype 24 95%
code_lang 62 39%
text_lang 30 99% 100%
file_mime 90 100%
risk (mAP) 6 100%

Evaluated on 1000 synthetic samples + 21 hand-curated real-world inputs. Base tier, ~5ms inference.

Note: code_lang synthetic accuracy reflects the challenge of 62-way classification with limited per-class support. Real-world accuracy across all heads is 52% (11/21 correct), up from 23% baseline before diverse training.

🚀 Quick Start

CLI

pip install picotype

echo "def hello():\n    return 42" | picotype --pretty
picotype --file document.txt
picotype --clip

Python

from picotype import PicoType, PicoTypeConfig, decode_output

model = PicoType(PicoTypeConfig()).eval()
# ... load checkpoint ...
result = decode_output(model(b"input bytes"), tier="base")

MCP Server (Claude/Cursor)

PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server

🏗 Architecture

Bytes → ByteEmbed(256→96d) → 3×Conv1D(k=3,5,7) → 2×BiAttention(RoPE) → Pool(mean‖max‖std) → 7×Matryoshka Heads
Component Description
ByteEmbed nn.Embedding(256, 96) — lookup-free byte embedding
Conv1D 3 parallel kernels (width 3, 5, 7) with residual + LayerNorm + GELU
BiAttention Bidirectional self-attention with Rotary Position Embeddings, 4 heads
Pool Mean + Max + Std concatenation over masked positions
Matryoshka Heads 4 tier slices of the pooled vector → 7 linear classifiers

Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)

🔧 Model Tiers

Tier Dim Params ONNX Size Speed
tiny 16 1.43M 207 KB ~3ms
small 64 1.45M 207 KB ~4ms
base 192 1.48M 209 KB ~5ms
pro 576 1.56M 206 KB ~12ms

All tiers share the same trunk; only the final linear layers differ. Switch tiers at inference with zero overhead.

🧪 Classification Heads

Head Classes Gated By Examples
coarse 12 text, code, link, image, file, config, markup, data, error, secret, archive, binary
modality 8 textual, binary_image, binary_archive, binary_executable, binary_document, binary_audio, binary_video, binary_other
subtype 24 config, markup, data json, yaml, toml, csv, html, markdown, sql, log, dockerfile
code_lang 62 code python, javascript, typescript, java, c, cpp, go, rust, kotlin, swift, bash, sql
text_lang 30 text en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi
file_mime 90 image, file text/html, application/json, application/pdf, image/png, video/mp4
risk 6 api_key, jwt, password, email, phone, ssh_key (probabilities)

🌐 Deployment

Platform URL
HuggingFace Space eulogik/pico-type
HuggingFace Model eulogik/pico-type
GitHub eulogik/pico-type
PyPI pip install picotype
Zenodo 10.5281/zenodo.20758542

📚 Documentation

📄 License

Apache 2.0 — free for commercial and personal use.


Built with ❤️ by eulogik

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pico_type-0.1.6.tar.gz (67.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pico_type-0.1.6-py3-none-any.whl (82.0 kB view details)

Uploaded Python 3

File details

Details for the file pico_type-0.1.6.tar.gz.

File metadata

  • Download URL: pico_type-0.1.6.tar.gz
  • Upload date:
  • Size: 67.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pico_type-0.1.6.tar.gz
Algorithm Hash digest
SHA256 ad296a4c736b6c839e4f25b2b74dbed72394486caf3f01b6b053e4bb77ced524
MD5 fa9181c8c1c4a6f881b57edd14f0a009
BLAKE2b-256 8ae351d49b0966ec1eb2c2f11e0f3174f658822cb5c4fb9372ee264f3ee8734c

See more details on using hashes here.

File details

Details for the file pico_type-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: pico_type-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 82.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pico_type-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9841a812ca2deff9edd2eca58fd59c63dab18df6492f3837cddde5caa02638fc
MD5 09bfa2b0a66630ad4cd5397b98c0549b
BLAKE2b-256 103f5ea645966d5f01ad4fc2cb12bbe1b6feb567af9c4dd4008fb716aa6d4a4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page