Why Pico-Type?

⚡

Byte-Level

Operates directly on raw bytes. No tokenizer, no parser, no language-specific pre-processing. Feed it anything from a Python script to a ZIP file.

★

Multi-Head Architecture

Seven specialized classification heads share one backbone: coarse type, modality, subtype, code language, text language, file MIME, and risk.

🚀

Tiny & Fast

Just 1.5 million parameters. The ONNX export is ~210 KB. Runs inference in milliseconds on CPU. Deploy anywhere — edge, browser, serverless.

🌐

62 Languages

Detects 62 programming languages from bytes alone: Python, JavaScript, Rust, Go, C++, Java, SQL, Bash, and 55 more. No shebang or extension required.

📦

Beyond Code

Classifies text (EN, FR, HI, more), config files (JSON, YAML, TOML, env), markup (HTML, Markdown, LaTeX), images (PNG, GIF), and binary archives.

⚙

ONNX + CLI

Built-in export to ONNX for cross-platform deployment. Comes with a CLI tool and an optional MCP server. Zero-config setup for CI/CD pipelines.

Seven Heads, One Model

Coarse

code / text / markup / config / image / binary / error

Modality

source / script / markup / prose / serialization / bitmap / archive / traceback

Subtype

general / web / data / system / shell / build / doc …

Code Language

62 languages: Python, JS, Rust, Go, Java, C++, SQL, Bash & more

Text Language

English, French, Hindi, Czech, Polish, …

File MIME

text/x-python, image/png, application/zip, …

Risk

safe / unknown / error

Real-World Accuracy: 20/21 (95.2%)

Tested against a hand-curated set covering every coarse type — from Python and JavaScript to ZIP binaries and Hindi text.

Input	Coarse	Modality	Code/Text Lang	File MIME	Risk
Python	code	source	Python	text/x-python	safe
JavaScript	code	source	JavaScript	text/javascript	safe
C	code	source	C	text/x-c	safe
Java	code	source	Java	text/x-java	safe
SQL	code	script	SQL	text/x-sql	safe
Bash	code	script	Bash	text/x-shellscript	safe
Rust	code	source	Rust	text/x-rust	safe
Go	code	source	Go	text/x-go	safe
CSS	code	source	CSS	text/css	safe
HTML	markup	markup	—	text/html	safe
English text	text	prose	English	text/plain	safe
French text	text	prose	French	text/plain	safe
Hindi text	text	prose	Hindi	text/plain	safe
JSON	config	serialization	—	application/json	safe
YAML	error →	serialization	—	text/yaml	safe
.env	config	config	—	text/plain	safe
Error traceback	error	traceback	—	text/plain	error
PNG	image	bitmap	—	image/png	safe
GIF	image	bitmap	—	image/gif	safe
ZIP archive	binary	archive	—	application/zip	safe

Only flub: YAML gets error instead of config for coarse type (subtype and MIME are correct). This is a fundamental byte-level ambiguity — YAML content and error messages share byte patterns at the first 2KB.

📄 Read the Paper on Zenodo

Use Cases

📜

Content Classification

Sort files, API payloads, or raw network data by type, language, and risk in CI/CD pipelines, upload services, or security tools.

💻

IDE Integration

Detect the programming language of unsaved buffers or snippet pastes without relying on file extensions. Works with any editor via the CLI or MCP server.

📊

Data Pipeline Routing

Route incoming data to the correct processing pipeline based on byte-level content analysis — no need to trust Content-Type headers.

🏆

Educational & Research

A production-grade reference for multi-task learning, distillation, and byte-level modeling. Study the architecture, extend it, or distill it further.

Pico‑Type — Bytes In,
Answers Out