Skip to main content

A tiny byte-level multi-head content classifier

Project description


license: apache-2.0 language:

  • en
  • multilingual tags:
  • byte-level
  • content-classification
  • onnx
  • edge-ai
  • matryoshka
  • multi-head
  • classifier
  • clipboard pipeline_tag: text-classification library_name: pico-type inference: parameters: provider: CPUExecutionProvider

pico-type

pico-type 🔍

A tiny byte-level multi-head content classifier — ~1.5M params, ~209KB ONNX, <6ms inference.

Classifies any content into 7 categories from raw bytes in a single forward pass.

License Python ONNX PyPI HuggingFace Space HuggingFace Model GitHub CI DOI

Built by eulogik — AI infrastructure for developers.


✨ Features

  • No tokenizer — operates directly on raw UTF-8 bytes (supports all languages, zero pre-processing)
  • 7 heads, one forward pass — coarse type, modality, subtype, code lang, text lang, file MIME, risk flags
  • 4 Matryoshka tiers — tiny (16d) → small (64d) → base (192d) → pro (576d)
  • ~200KB ONNX — deploy on edge devices, serverless functions, browser (WebAssembly)
  • <6ms inference on CPU via ONNX Runtime (base tier, 1024 bytes)
  • CLI, Gradio Space, MCP server — ready for any integration

📊 Performance

Head Classes Synthetic Accuracy Real-World Accuracy
coarse 12 100% 86%
modality 8 100% 100%
subtype 24 95%
code_lang 62 39%
text_lang 30 99% 100%
file_mime 90 100%
risk (mAP) 6 100%

Evaluated on 1000 synthetic samples + 21 hand-curated real-world inputs. Base tier, ~5ms inference.

Note: code_lang synthetic accuracy reflects the challenge of 62-way classification with limited per-class support. Real-world accuracy across all heads is 52% (11/21 correct), up from 23% baseline before diverse training.

🚀 Quick Start

CLI

pip install picotype

echo "def hello():\n    return 42" | picotype --pretty
picotype --file document.txt
picotype --clip

Python

from picotype import PicoType, PicoTypeConfig, decode_output

model = PicoType(PicoTypeConfig()).eval()
# ... load checkpoint ...
result = decode_output(model(b"input bytes"), tier="base")

MCP Server (Claude/Cursor)

PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server

🏗 Architecture

Bytes → ByteEmbed(256→96d) → 3×Conv1D(k=3,5,7) → 2×BiAttention(RoPE) → Pool(mean‖max‖std) → 7×Matryoshka Heads
Component Description
ByteEmbed nn.Embedding(256, 96) — lookup-free byte embedding
Conv1D 3 parallel kernels (width 3, 5, 7) with residual + LayerNorm + GELU
BiAttention Bidirectional self-attention with Rotary Position Embeddings, 4 heads
Pool Mean + Max + Std concatenation over masked positions
Matryoshka Heads 4 tier slices of the pooled vector → 7 linear classifiers

Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)

🔧 Model Tiers

Tier Dim Params ONNX Size Speed
tiny 16 1.43M 207 KB ~3ms
small 64 1.45M 207 KB ~4ms
base 192 1.48M 209 KB ~5ms
pro 576 1.56M 206 KB ~12ms

All tiers share the same trunk; only the final linear layers differ. Switch tiers at inference with zero overhead.

🧪 Classification Heads

Head Classes Gated By Examples
coarse 12 text, code, link, image, file, config, markup, data, error, secret, archive, binary
modality 8 textual, binary_image, binary_archive, binary_executable, binary_document, binary_audio, binary_video, binary_other
subtype 24 config, markup, data json, yaml, toml, csv, html, markdown, sql, log, dockerfile
code_lang 62 code python, javascript, typescript, java, c, cpp, go, rust, kotlin, swift, bash, sql
text_lang 30 text en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi
file_mime 90 image, file text/html, application/json, application/pdf, image/png, video/mp4
risk 6 api_key, jwt, password, email, phone, ssh_key (probabilities)

🌐 Deployment

Platform URL
HuggingFace Space eulogik/pico-type
HuggingFace Model eulogik/pico-type
GitHub eulogik/pico-type
PyPI pip install picotype
Zenodo 10.5281/zenodo.20758542

📚 Documentation

📄 License

Apache 2.0 — free for commercial and personal use.


Built with ❤️ by eulogik

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pico_type-0.1.7.tar.gz (67.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pico_type-0.1.7-py3-none-any.whl (81.9 kB view details)

Uploaded Python 3

File details

Details for the file pico_type-0.1.7.tar.gz.

File metadata

  • Download URL: pico_type-0.1.7.tar.gz
  • Upload date:
  • Size: 67.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pico_type-0.1.7.tar.gz
Algorithm Hash digest
SHA256 7644ea498139228b8cfecc47d95e894c99505230c896e73f4d761824b434fc8b
MD5 378acf72efcf1bf0f7756e83aab53c80
BLAKE2b-256 2508f763beeeaec608f31167b2773e0bcd3bbf3b2c228e595d8312b7ec5bfba7

See more details on using hashes here.

File details

Details for the file pico_type-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: pico_type-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 81.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pico_type-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 a4eaf24dccbb3ad35fb7310ad9fc4ccdb1cb686aa24365021dff417654db475c
MD5 57af59bc547131a17a500b80aedfb685
BLAKE2b-256 e77d7cbeb4ddd6782e453e02f8b9b464f1d3bff4660ccfa1493afe611f6e03d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page