Skip to main content

Skylar — local, sovereign, from-scratch LLMs. CLI + loader for the Skylar model family: generative chat, embeddings, and a COBOL specialist.

Project description

skylar

A tiny runtime + CLI for the Skylar model family — local, sovereign, from-scratch LLMs you load, run, and serve with one pip install. It covers generative chat, embeddings / retrieval, and a COBOL code specialist — 236M–390M class, runnable on a single GPU or CPU, no data leaving your machine.

Models live under Sophia-AI on HuggingFace: Skylar-236M-Base · Skylar-236M-Chat · Skylar-236M-Embed · Skylar-390M-Cobol.

Install

pip install skylar
# optional HTTP server:
pip install "skylar[serve]"

Use it — CLI

# chat with any Skylar generative model (no forced persona — steer it with --system)
skylar chat --model Sophia-AI/Skylar-236M-Chat --system "Sei un assistente che risponde dal contesto."

# embeddings / retrieval (any SkylarEmbedder model)
skylar embed --model Sophia-AI/Skylar-236M-Embed --query "prestito casa" --docs "mutuo" "meteo"

# one-shot generation (HF repo id or a local checkpoint dir)
skylar generate --model Sophia-AI/Skylar-236M-Chat --prompt "..."

# the COBOL specialist — completes a COBOL stub into a full, compilable program
#   (auto-downloads Skylar-390M-Cobol; it's a stub completer, not a chatbot)
skylar cobol --example
skylar cobol --stub-file my_task.cbl --compile        # your own stub + GnuCOBOL check

# multi-user OpenAI-compatible server (needs the [serve] extra) — full details in "Serve it" below
skylar serve --model <any-skylar-model> --port 8000     # interactive docs at http://localhost:8000/docs

Decoding is greedy by default (--temperature 0.0); there is no forced system prompt — pass --system "..." to steer a chat model. (The skylar cobol subcommand handles the COBOL prompt format for you.)

Use it — Python

import skylar

# generative chat — pass your own system prompt (no forced persona)
m = skylar.load("Sophia-AI/Skylar-236M-Chat")          # HF repo id or a local dir
print(m.generate("Domanda: dove ha sede la Banca d'Italia?",
                 system="Rispondi solo dal contesto fornito."))
for delta in m.stream("..."):                          # streaming
    print(delta, end="", flush=True)

# embeddings / retrieval
e = skylar.load_embedder("Sophia-AI/Skylar-236M-Embed")
ranked = e.rank("costo del denaro", ["la BCE alza i tassi", "ricetta pizza"])

# the COBOL specialist — a stub completer (not a chatbot)
c = skylar.load("Sophia-AI/Skylar-390M-Cobol")
print(c.complete_cobol(my_stub))                       # -> full, compilable COBOL program

skylar also registers the architecture with 🤗 Transformers, so this works too:

import skylar  # registers nano-transformer
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Sophia-AI/Skylar-236M-Chat")

Serve it — multi-user HTTP API (skylar serve)

skylar serve --model <id> turns any Skylar generative model into an OpenAI-compatible HTTP server built for concurrent users. Requests from many clients are fused into dynamic micro-batches on a single worker that owns the model — so one GPU (or CPU) serves a whole demo without per-request OOM or GPU contention — and each request can stream its tokens.

pip install "skylar[serve]"
skylar serve --model Sophia-AI/Skylar-236M-Chat          # swap the id for ANY Skylar model
#  → http://127.0.0.1:8000   ·   interactive docs: http://127.0.0.1:8000/docs

Open /docs for the auto-generated Swagger UI — every endpoint, schema, and example is described there (or /redoc for ReDoc). The model is whatever you pass to --model (an HF repo id or a local checkpoint dir); an embedder model is auto-detected and served at /v1/embeddings instead.

Method & path What it does
POST /v1/chat/completions OpenAI chat format. "stream": true → Server-Sent Events.
POST /generate One prompt → one completion.
GET /health Liveness + which model/device is loaded.
GET /metrics Throughput, batch sizes, queue depth.
# one-shot completion
curl localhost:8000/generate -H 'content-type: application/json' \
  -d '{"prompt": "Dove ha sede la Banca d'\''Italia?", "max_new_tokens": 64}'

# OpenAI chat format (+ "stream": true for SSE)
curl -N localhost:8000/v1/chat/completions -H 'content-type: application/json' -d '{
  "messages": [{"role":"system","content":"Sei un esperto COBOL."},
               {"role":"user","content":"Somma due campi PIC 9(4)."}],
  "max_tokens": 256, "stream": true
}'

Drop-in with the official OpenAI client — just point base_url at the server:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(
    model="Sophia-AI/Skylar-236M-Chat",
    messages=[{"role": "user", "content": "Spiega cosa fa questo COBOL ..."}],
    stream=True,
)
for chunk in r:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Tuning concurrency

Flag Default Meaning
--max-batch 8 Max requests fused into one forward pass. Raise for more throughput until VRAM/latency says stop.
--max-wait-ms 15 How long to wait for stragglers before launching a batch. Higher = bigger batches, slightly more latency.
--max-queue 256 Input-queue depth; beyond it the server returns 503 (backpressure instead of OOM).

How it works (for implementers / a future maintainer)

The server is skylar/serve.py — the model code (decoder.py / attention.py) is left untouched:

  • One worker owns the model. Async routes enqueue requests; a single background thread pulls a micro-batch (up to --max-batch, waiting --max-wait-ms) and runs it. No two CUDA calls race, and there is exactly one set of KV-caches in flight — so concurrency can't OOM the box.
  • True batched decoding. generate_batch() left-pads ragged prompts and builds a 4D additive mask (causal + pad) that NanoTransformer.forward already accepts (its dense-mask SDPA path), so prompts of different lengths decode together with a shared KV-cache and per-row EOS stop. Per-row sampling mirrors NanoTransformer.generate exactly → batched output is token-for-token identical to single-stream (proven by tests/test_batch_equiv.py).
  • Batching + streaming coexist. Each request carries its own queue; the worker pushes text deltas into it as tokens are produced, so every request in a batch streams independently.
  • Current limits (PoC). Static micro-batching (a batch starts and finishes together). For heavy, time-skewed load the next step is continuous batching (adding requests to an in-flight batch). One model per process; greedy is the default, sampling params are per-request.
python tests/test_batch_equiv.py     # run after touching batching/masking: batched == single-stream

What's inside

The Skylar models use a custom decoder (NanoTransformer, Qwen3-style: RMSNorm + RoPE + GQA + QK-Norm + SwiGLU), trained 100% from scratch (no third-party pretrained weights). This package vendors the architecture so the published weights load anywhere — no private framework needed.

License

Apache-2.0. Models & code IP: A. Ivanovitch (Sophia AI).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skylar-0.3.1.tar.gz (41.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skylar-0.3.1-py3-none-any.whl (44.0 kB view details)

Uploaded Python 3

File details

Details for the file skylar-0.3.1.tar.gz.

File metadata

  • Download URL: skylar-0.3.1.tar.gz
  • Upload date:
  • Size: 41.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for skylar-0.3.1.tar.gz
Algorithm Hash digest
SHA256 e92520d34aee946ed6d8b59bad3353114403f56e825b3d7edb3ac6eff0d611d0
MD5 1f60683072fdd0fd1204015ffa471178
BLAKE2b-256 d8ed48629477492b3d55fdb530687fd2ee6539cf9d8f21be4f1ee231b57cf0a1

See more details on using hashes here.

File details

Details for the file skylar-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: skylar-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 44.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for skylar-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 85660f7f81c20029e595fd7bdbaec1bdba37829ead67413d4975c1330480cd12
MD5 5c9f11cbba6b55b473fd9d8c2da6509d
BLAKE2b-256 d318f96eaea7d56e100f409390b8c76f2783f17b75f54d6add230998c09a3787

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page