pico-type logo
ⓘ by Eulogik · Open Source · Apache 2.0

Pico‑Type — Bytes In,
Answers Out

A 1.5-million-parameter, byte-level, multi-head classifier that identifies code language, content type, modality, text language, and risk from raw bytes — no parsing, no dependencies.

$ pip install pico-type GitHub HuggingFace 📄 Paper
1.5M
Parameters
95.2%
Real-World Accuracy
62
Code Languages
7
Classification Heads
2KB
Context Window

Why Pico-Type?

Byte-Level

Operates directly on raw bytes. No tokenizer, no parser, no language-specific pre-processing. Feed it anything from a Python script to a ZIP file.

Multi-Head Architecture

Seven specialized classification heads share one backbone: coarse type, modality, subtype, code language, text language, file MIME, and risk.

🚀

Tiny & Fast

Just 1.5 million parameters. The ONNX export is ~210 KB. Runs inference in milliseconds on CPU. Deploy anywhere — edge, browser, serverless.

🌐

62 Languages

Detects 62 programming languages from bytes alone: Python, JavaScript, Rust, Go, C++, Java, SQL, Bash, and 55 more. No shebang or extension required.

📦

Beyond Code

Classifies text (EN, FR, HI, more), config files (JSON, YAML, TOML, env), markup (HTML, Markdown, LaTeX), images (PNG, GIF), and binary archives.

ONNX + CLI

Built-in export to ONNX for cross-platform deployment. Comes with a CLI tool and an optional MCP server. Zero-config setup for CI/CD pipelines.

Seven Heads, One Model

Coarse

code / text / markup / config / image / binary / error

Modality

source / script / markup / prose / serialization / bitmap / archive / traceback

Subtype

general / web / data / system / shell / build / doc …

Code Language

62 languages: Python, JS, Rust, Go, Java, C++, SQL, Bash & more

Text Language

English, French, Hindi, Czech, Polish, …

File MIME

text/x-python, image/png, application/zip, …

Risk

safe / unknown / error

Real-World Accuracy: 20/21 (95.2%)

Tested against a hand-curated set covering every coarse type — from Python and JavaScript to ZIP binaries and Hindi text.

InputCoarseModalityCode/Text LangFile MIMERisk
PythoncodesourcePythontext/x-pythonsafe
JavaScriptcodesourceJavaScripttext/javascriptsafe
CcodesourceCtext/x-csafe
JavacodesourceJavatext/x-javasafe
SQLcodescriptSQLtext/x-sqlsafe
BashcodescriptBashtext/x-shellscriptsafe
RustcodesourceRusttext/x-rustsafe
GocodesourceGotext/x-gosafe
CSScodesourceCSStext/csssafe
HTMLmarkupmarkuptext/htmlsafe
English texttextproseEnglishtext/plainsafe
French texttextproseFrenchtext/plainsafe
Hindi texttextproseHinditext/plainsafe
JSONconfigserializationapplication/jsonsafe
YAMLerror →serializationtext/yamlsafe
.envconfigconfigtext/plainsafe
Error tracebackerrortracebacktext/plainerror
PNGimagebitmapimage/pngsafe
GIFimagebitmapimage/gifsafe
ZIP archivebinaryarchiveapplication/zipsafe

Only flub: YAML gets error instead of config for coarse type (subtype and MIME are correct). This is a fundamental byte-level ambiguity — YAML content and error messages share byte patterns at the first 2KB.

📄 Read the Paper on Zenodo

Use Cases

📜

Content Classification

Sort files, API payloads, or raw network data by type, language, and risk in CI/CD pipelines, upload services, or security tools.

💻

IDE Integration

Detect the programming language of unsaved buffers or snippet pastes without relying on file extensions. Works with any editor via the CLI or MCP server.

📊

Data Pipeline Routing

Route incoming data to the correct processing pipeline based on byte-level content analysis — no need to trust Content-Type headers.

🏆

Educational & Research

A production-grade reference for multi-task learning, distillation, and byte-level modeling. Study the architecture, extend it, or distill it further.

Get Started in One Command

Install from PyPI and classify any file in seconds. No GPU required. No cloud service. Fully open source.

$ pip install pico-type View on GitHub