Skylar — local, sovereign, from-scratch LLMs. CLI + loader for the Skylar model family: generative chat, embeddings, and a COBOL specialist.
Project description
skylar
A tiny runtime + CLI for the Skylar model family — local, sovereign, from-scratch LLMs you
load, run, and serve with one pip install. It covers generative chat, embeddings /
retrieval, and a COBOL code specialist — 236M–390M class, runnable on a single GPU or CPU,
no data leaving your machine.
Models live under Sophia-AI on HuggingFace:
Skylar-236M-Base · Skylar-236M-Chat · Skylar-236M-Embed · Skylar-390M-Cobol.
Install
pip install skylar
# optional HTTP server:
pip install "skylar[serve]"
Use it — CLI
# chat with any Skylar generative model (no forced persona — steer it with --system)
skylar chat --model Sophia-AI/Skylar-236M-Chat --system "Sei un assistente che risponde dal contesto."
# embeddings / retrieval (any SkylarEmbedder model)
skylar embed --model Sophia-AI/Skylar-236M-Embed --query "prestito casa" --docs "mutuo" "meteo"
# one-shot generation (HF repo id or a local checkpoint dir)
skylar generate --model Sophia-AI/Skylar-236M-Chat --prompt "..."
# the COBOL specialist — completes a COBOL stub into a full, compilable program
# (auto-downloads Skylar-390M-Cobol; it's a stub completer, not a chatbot)
skylar cobol --example
skylar cobol --stub-file my_task.cbl --compile # your own stub + GnuCOBOL check
# multi-user OpenAI-compatible server (needs the [serve] extra) — full details in "Serve it" below
skylar serve --model <any-skylar-model> --port 8000 # interactive docs at http://localhost:8000/docs
Decoding is greedy by default (--temperature 0.0); there is no forced system prompt — pass
--system "..." to steer a chat model. (The skylar cobol subcommand handles the COBOL prompt
format for you.)
Use it — Python
import skylar
# generative chat — pass your own system prompt (no forced persona)
m = skylar.load("Sophia-AI/Skylar-236M-Chat") # HF repo id or a local dir
print(m.generate("Domanda: dove ha sede la Banca d'Italia?",
system="Rispondi solo dal contesto fornito."))
for delta in m.stream("..."): # streaming
print(delta, end="", flush=True)
# embeddings / retrieval
e = skylar.load_embedder("Sophia-AI/Skylar-236M-Embed")
ranked = e.rank("costo del denaro", ["la BCE alza i tassi", "ricetta pizza"])
# the COBOL specialist — a stub completer (not a chatbot)
c = skylar.load("Sophia-AI/Skylar-390M-Cobol")
print(c.complete_cobol(my_stub)) # -> full, compilable COBOL program
skylar also registers the architecture with 🤗 Transformers, so this works too:
import skylar # registers nano-transformer
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Sophia-AI/Skylar-236M-Chat")
Serve it — multi-user HTTP API (skylar serve)
skylar serve --model <id> turns any Skylar generative model into an OpenAI-compatible HTTP
server built for concurrent users. Requests from many clients are fused into dynamic
micro-batches on a single worker that owns the model — so one GPU (or CPU) serves a whole demo
without per-request OOM or GPU contention — and each request can stream its tokens.
pip install "skylar[serve]"
skylar serve --model Sophia-AI/Skylar-236M-Chat # swap the id for ANY Skylar model
# → http://127.0.0.1:8000 · interactive docs: http://127.0.0.1:8000/docs
Open /docs for the auto-generated Swagger UI — every endpoint, schema, and example is
described there (or /redoc for ReDoc). The model is whatever you pass to --model (an HF
repo id or a local checkpoint dir); an embedder model is auto-detected and served at
/v1/embeddings instead.
| Method & path | What it does |
|---|---|
POST /v1/chat/completions |
OpenAI chat format. "stream": true → Server-Sent Events. |
POST /generate |
One prompt → one completion. |
GET /health |
Liveness + which model/device is loaded. |
GET /metrics |
Throughput, batch sizes, queue depth. |
# one-shot completion
curl localhost:8000/generate -H 'content-type: application/json' \
-d '{"prompt": "Dove ha sede la Banca d'\''Italia?", "max_new_tokens": 64}'
# OpenAI chat format (+ "stream": true for SSE)
curl -N localhost:8000/v1/chat/completions -H 'content-type: application/json' -d '{
"messages": [{"role":"system","content":"Sei un esperto COBOL."},
{"role":"user","content":"Somma due campi PIC 9(4)."}],
"max_tokens": 256, "stream": true
}'
Drop-in with the official OpenAI client — just point base_url at the server:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(
model="Sophia-AI/Skylar-236M-Chat",
messages=[{"role": "user", "content": "Spiega cosa fa questo COBOL ..."}],
stream=True,
)
for chunk in r:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Tuning concurrency
| Flag | Default | Meaning |
|---|---|---|
--max-batch |
8 |
Max requests fused into one forward pass. Raise for more throughput until VRAM/latency says stop. |
--max-wait-ms |
15 |
How long to wait for stragglers before launching a batch. Higher = bigger batches, slightly more latency. |
--max-queue |
256 |
Input-queue depth; beyond it the server returns 503 (backpressure instead of OOM). |
How it works (for implementers / a future maintainer)
The server is skylar/serve.py — the model code (decoder.py / attention.py) is left
untouched:
- One worker owns the model. Async routes enqueue requests; a single background thread pulls a
micro-batch (up to
--max-batch, waiting--max-wait-ms) and runs it. No two CUDA calls race, and there is exactly one set of KV-caches in flight — so concurrency can't OOM the box. - True batched decoding.
generate_batch()left-pads ragged prompts and builds a 4D additive mask (causal + pad) thatNanoTransformer.forwardalready accepts (its dense-mask SDPA path), so prompts of different lengths decode together with a shared KV-cache and per-row EOS stop. Per-row sampling mirrorsNanoTransformer.generateexactly → batched output is token-for-token identical to single-stream (proven bytests/test_batch_equiv.py). - Batching + streaming coexist. Each request carries its own queue; the worker pushes text deltas into it as tokens are produced, so every request in a batch streams independently.
- Current limits (PoC). Static micro-batching (a batch starts and finishes together). For heavy, time-skewed load the next step is continuous batching (adding requests to an in-flight batch). One model per process; greedy is the default, sampling params are per-request.
python tests/test_batch_equiv.py # run after touching batching/masking: batched == single-stream
What's inside
The Skylar models use a custom decoder (NanoTransformer, Qwen3-style: RMSNorm + RoPE + GQA +
QK-Norm + SwiGLU), trained 100% from scratch (no third-party pretrained weights). This package
vendors the architecture so the published weights load anywhere — no private framework needed.
License
Apache-2.0. Models & code IP: A. Ivanovitch (Sophia AI).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file skylar-0.3.0.tar.gz.
File metadata
- Download URL: skylar-0.3.0.tar.gz
- Upload date:
- Size: 41.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ceddcdd9165630a1ce977f09cc4ee76ae03907ee5849c32de7131616fb8b42df
|
|
| MD5 |
31a470da0ad2089903bc499b7739c1e1
|
|
| BLAKE2b-256 |
5f9944033208e59992bbaee9432b134c11febd5decb66513701e57eab5e8cc26
|
File details
Details for the file skylar-0.3.0-py3-none-any.whl.
File metadata
- Download URL: skylar-0.3.0-py3-none-any.whl
- Upload date:
- Size: 43.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58447241e9c98dc99f8a1659446e27c8038e0b980d0a78c139ba3fbcb6d9868f
|
|
| MD5 |
85c27f3af606f1a4b2075d6ac47d23d3
|
|
| BLAKE2b-256 |
16e0b8e4866bb595d77a1edc3367dc8b4b83f9643748b19a8da7828c8606c4ba
|