Skylar — local, sovereign, from-scratch LLMs. CLI + loader for the Skylar model family: generative chat, embeddings, and a COBOL specialist.

These details have not been verified by PyPI

Project links

Models

Project description

skylar

A tiny runtime + CLI for the Skylar model family — local, sovereign, from-scratch LLMs you load, run, and serve with one pip install. It covers generative chat, embeddings / retrieval, and a COBOL code specialist — 236M–390M class, runnable on a single GPU or CPU, no data leaving your machine.

Models live under Sophia-AI on HuggingFace: Skylar-236M-Base · Skylar-236M-Chat · Skylar-236M-Embed · Skylar-390M-Cobol.

Install

pip install skylar
# optional HTTP server:
pip install "skylar[serve]"

Use it — CLI

# chat with any Skylar generative model (no forced persona — steer it with --system)
skylar chat --model Sophia-AI/Skylar-236M-Chat --system "Sei un assistente che risponde dal contesto."

# embeddings / retrieval (any SkylarEmbedder model)
skylar embed --model Sophia-AI/Skylar-236M-Embed --query "prestito casa" --docs "mutuo" "meteo"

# one-shot generation (HF repo id or a local checkpoint dir)
skylar generate --model Sophia-AI/Skylar-236M-Chat --prompt "..."

# the COBOL specialist — completes a COBOL stub into a full, compilable program
#   (auto-downloads Skylar-390M-Cobol; it's a stub completer, not a chatbot)
skylar cobol --example
skylar cobol --stub-file my_task.cbl --compile        # your own stub + GnuCOBOL check

# multi-user OpenAI-compatible server (needs the [serve] extra) — full details in "Serve it" below
skylar serve --model <any-skylar-model> --port 8000     # interactive docs at http://localhost:8000/docs

Decoding is greedy by default (--temperature 0.0); there is no forced system prompt — pass --system "..." to steer a chat model. (The skylar cobol subcommand handles the COBOL prompt format for you.)

Use it — Python

import skylar

# generative chat — pass your own system prompt (no forced persona)
m = skylar.load("Sophia-AI/Skylar-236M-Chat")          # HF repo id or a local dir
print(m.generate("Domanda: dove ha sede la Banca d'Italia?",
                 system="Rispondi solo dal contesto fornito."))
for delta in m.stream("..."):                          # streaming
    print(delta, end="", flush=True)

# embeddings / retrieval
e = skylar.load_embedder("Sophia-AI/Skylar-236M-Embed")
ranked = e.rank("costo del denaro", ["la BCE alza i tassi", "ricetta pizza"])

# the COBOL specialist — a stub completer (not a chatbot)
c = skylar.load("Sophia-AI/Skylar-390M-Cobol")
print(c.complete_cobol(my_stub))                       # -> full, compilable COBOL program

skylar also registers the architecture with 🤗 Transformers, so this works too:

import skylar  # registers nano-transformer
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Sophia-AI/Skylar-236M-Chat")

Serve it — multi-user HTTP API (`skylar serve`)

skylar serve --model <id> turns any Skylar generative model into an OpenAI-compatible HTTP server built for concurrent users. Requests from many clients are fused into dynamic micro-batches on a single worker that owns the model — so one GPU (or CPU) serves a whole demo without per-request OOM or GPU contention — and each request can stream its tokens.

pip install "skylar[serve]"
skylar serve --model Sophia-AI/Skylar-236M-Chat          # swap the id for ANY Skylar model
#  → http://127.0.0.1:8000   ·   interactive docs: http://127.0.0.1:8000/docs

Open /docs for the auto-generated Swagger UI — every endpoint, schema, and example is described there (or /redoc for ReDoc). The model is whatever you pass to --model (an HF repo id or a local checkpoint dir); an embedder model is auto-detected and served at /v1/embeddings instead.

Method & path	What it does
`POST /v1/chat/completions`	OpenAI chat format. `"stream": true` → Server-Sent Events.
`POST /generate`	One `prompt` → one `completion`.
`GET /health`	Liveness + which model/device is loaded.
`GET /metrics`	Throughput, batch sizes, queue depth.

# one-shot completion
curl localhost:8000/generate -H 'content-type: application/json' \
  -d '{"prompt": "Dove ha sede la Banca d'\''Italia?", "max_new_tokens": 64}'

# OpenAI chat format (+ "stream": true for SSE)
curl -N localhost:8000/v1/chat/completions -H 'content-type: application/json' -d '{
  "messages": [{"role":"system","content":"Sei un esperto COBOL."},
               {"role":"user","content":"Somma due campi PIC 9(4)."}],
  "max_tokens": 256, "stream": true
}'

Drop-in with the official OpenAI client — just point base_url at the server:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(
    model="Sophia-AI/Skylar-236M-Chat",
    messages=[{"role": "user", "content": "Spiega cosa fa questo COBOL ..."}],
    stream=True,
)
for chunk in r:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Tuning concurrency

Flag	Default	Meaning
`--max-batch`	`8`	Max requests fused into one forward pass. Raise for more throughput until VRAM/latency says stop.
`--max-wait-ms`	`15`	How long to wait for stragglers before launching a batch. Higher = bigger batches, slightly more latency.
`--max-queue`	`256`	Input-queue depth; beyond it the server returns 503 (backpressure instead of OOM).

How it works (for implementers / a future maintainer)

The server is skylar/serve.py — the model code (decoder.py / attention.py) is left untouched:

One worker owns the model. Async routes enqueue requests; a single background thread pulls a micro-batch (up to --max-batch, waiting --max-wait-ms) and runs it. No two CUDA calls race, and there is exactly one set of KV-caches in flight — so concurrency can't OOM the box.
True batched decoding. generate_batch() left-pads ragged prompts and builds a 4D additive mask (causal + pad) that NanoTransformer.forward already accepts (its dense-mask SDPA path), so prompts of different lengths decode together with a shared KV-cache and per-row EOS stop. Per-row sampling mirrors NanoTransformer.generate exactly → batched output is token-for-token identical to single-stream (proven by tests/test_batch_equiv.py).
Batching + streaming coexist. Each request carries its own queue; the worker pushes text deltas into it as tokens are produced, so every request in a batch streams independently.
Current limits (PoC). Static micro-batching (a batch starts and finishes together). For heavy, time-skewed load the next step is continuous batching (adding requests to an in-flight batch). One model per process; greedy is the default, sampling params are per-request.

python tests/test_batch_equiv.py     # run after touching batching/masking: batched == single-stream

What's inside

The Skylar models use a custom decoder (NanoTransformer, Qwen3-style: RMSNorm + RoPE + GQA + QK-Norm + SwiGLU), trained 100% from scratch (no third-party pretrained weights). This package vendors the architecture so the published weights load anywhere — no private framework needed.

License

Apache-2.0. Models & code IP: A. Ivanovitch (Sophia AI).

Project details

These details have not been verified by PyPI

Project links

Models

Release history Release notifications | RSS feed

0.3.2

Jun 16, 2026

0.3.1

Jun 16, 2026

This version

0.3.0

Jun 16, 2026

0.2.3

Jun 15, 2026

0.2.1

Jun 15, 2026

0.2.0

Jun 15, 2026

0.1.0

Jun 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skylar-0.3.0.tar.gz (41.1 kB view details)

Uploaded Jun 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

skylar-0.3.0-py3-none-any.whl (43.4 kB view details)

Uploaded Jun 16, 2026 Python 3

File details

Details for the file skylar-0.3.0.tar.gz.

File metadata

Download URL: skylar-0.3.0.tar.gz
Upload date: Jun 16, 2026
Size: 41.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for skylar-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`ceddcdd9165630a1ce977f09cc4ee76ae03907ee5849c32de7131616fb8b42df`
MD5	`31a470da0ad2089903bc499b7739c1e1`
BLAKE2b-256	`5f9944033208e59992bbaee9432b134c11febd5decb66513701e57eab5e8cc26`

See more details on using hashes here.

File details

Details for the file skylar-0.3.0-py3-none-any.whl.

File metadata

Download URL: skylar-0.3.0-py3-none-any.whl
Upload date: Jun 16, 2026
Size: 43.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for skylar-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`58447241e9c98dc99f8a1659446e27c8038e0b980d0a78c139ba3fbcb6d9868f`
MD5	`85c27f3af606f1a4b2075d6ac47d23d3`
BLAKE2b-256	`16e0b8e4866bb595d77a1edc3367dc8b4b83f9643748b19a8da7828c8606c4ba`

See more details on using hashes here.

skylar 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

skylar

Install

Use it — CLI

Use it — Python

Serve it — multi-user HTTP API (`skylar serve`)

Tuning concurrency

How it works (for implementers / a future maintainer)

What's inside

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

skylar 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

skylar

Install

Use it — CLI

Use it — Python

Serve it — multi-user HTTP API (skylar serve)

Tuning concurrency

How it works (for implementers / a future maintainer)

What's inside

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Serve it — multi-user HTTP API (`skylar serve`)