Skip to main content

Skylar — local, sovereign, from-scratch LLMs. CLI + loader for the Skylar model family: generative chat, embeddings, and a COBOL specialist.

Project description

skylar

A tiny runtime + CLI for the Skylar model family — local, sovereign, from-scratch LLMs you load, run, and serve with one pip install. It covers generative chat, embeddings / retrieval, and a COBOL code specialist — 236M–390M class, runnable on a single GPU or CPU, no data leaving your machine.

Models live under Sophia-AI on HuggingFace: Skylar-236M-Base · Skylar-236M-Chat · Skylar-236M-Embed · Skylar-390M-Cobol.

Install

pip install skylar
# optional HTTP server:
pip install "skylar[serve]"

Use it — CLI

# chat with any Skylar generative model (no forced persona — steer it with --system)
skylar chat --model Sophia-AI/Skylar-236M-Chat --system "Sei un assistente che risponde dal contesto."

# embeddings / retrieval (any SkylarEmbedder model)
skylar embed --model Sophia-AI/Skylar-236M-Embed --query "prestito casa" --docs "mutuo" "meteo"

# one-shot generation (HF repo id or a local checkpoint dir)
skylar generate --model Sophia-AI/Skylar-236M-Chat --prompt "..."

# the COBOL specialist — completes a COBOL stub into a full, compilable program
#   (auto-downloads Skylar-390M-Cobol; it's a stub completer, not a chatbot)
skylar cobol --example
skylar cobol --stub-file my_task.cbl --compile        # your own stub + GnuCOBOL check

# multi-user OpenAI-compatible server (needs the [serve] extra) — full details in "Serve it" below
skylar serve --model <any-skylar-model> --port 8000     # interactive docs at http://localhost:8000/docs

Decoding is greedy by default (--temperature 0.0); there is no forced system prompt — pass --system "..." to steer a chat model. (The skylar cobol subcommand handles the COBOL prompt format for you.)

Use it — Python

import skylar

# generative chat — pass your own system prompt (no forced persona)
m = skylar.load("Sophia-AI/Skylar-236M-Chat")          # HF repo id or a local dir
print(m.generate("Domanda: dove ha sede la Banca d'Italia?",
                 system="Rispondi solo dal contesto fornito."))
for delta in m.stream("..."):                          # streaming
    print(delta, end="", flush=True)

# embeddings / retrieval
e = skylar.load_embedder("Sophia-AI/Skylar-236M-Embed")
ranked = e.rank("costo del denaro", ["la BCE alza i tassi", "ricetta pizza"])

# the COBOL specialist — a stub completer (not a chatbot)
c = skylar.load("Sophia-AI/Skylar-390M-Cobol")
print(c.complete_cobol(my_stub))                       # -> full, compilable COBOL program

skylar also registers the architecture with 🤗 Transformers, so this works too:

import skylar  # registers nano-transformer
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Sophia-AI/Skylar-236M-Chat")

Serve it — multi-user HTTP API (skylar serve)

skylar serve --model <id> turns any Skylar generative model into an OpenAI-compatible HTTP server built for concurrent users. Requests from many clients are fused into dynamic micro-batches on a single worker that owns the model — so one GPU (or CPU) serves a whole demo without per-request OOM or GPU contention — and each request can stream its tokens.

pip install "skylar[serve]"
skylar serve --model Sophia-AI/Skylar-236M-Chat          # swap the id for ANY Skylar model
#  → http://127.0.0.1:8000   ·   interactive docs: http://127.0.0.1:8000/docs

Open /docs for the auto-generated Swagger UI — every endpoint, schema, and example is described there (or /redoc for ReDoc). The model is whatever you pass to --model (an HF repo id or a local checkpoint dir); an embedder model is auto-detected and served at /v1/embeddings instead.

Method & path What it does
POST /v1/chat/completions OpenAI chat format. "stream": true → Server-Sent Events.
POST /generate One prompt → one completion.
GET /health Liveness + which model/device is loaded.
GET /metrics Throughput, batch sizes, queue depth.
# one-shot completion
curl localhost:8000/generate -H 'content-type: application/json' \
  -d '{"prompt": "Dove ha sede la Banca d'\''Italia?", "max_new_tokens": 64}'

# OpenAI chat format (+ "stream": true for SSE)
curl -N localhost:8000/v1/chat/completions -H 'content-type: application/json' -d '{
  "messages": [{"role":"system","content":"Sei un esperto COBOL."},
               {"role":"user","content":"Somma due campi PIC 9(4)."}],
  "max_tokens": 256, "stream": true
}'

Drop-in with the official OpenAI client — just point base_url at the server:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(
    model="Sophia-AI/Skylar-236M-Chat",
    messages=[{"role": "user", "content": "Spiega cosa fa questo COBOL ..."}],
    stream=True,
)
for chunk in r:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Tuning concurrency

Flag Default Meaning
--max-batch 8 Max requests fused into one forward pass. Raise for more throughput until VRAM/latency says stop.
--max-wait-ms 15 How long to wait for stragglers before launching a batch. Higher = bigger batches, slightly more latency.
--max-queue 256 Input-queue depth; beyond it the server returns 503 (backpressure instead of OOM).

How it works (for implementers / a future maintainer)

The server is skylar/serve.py — the model code (decoder.py / attention.py) is left untouched:

  • One worker owns the model. Async routes enqueue requests; a single background thread pulls a micro-batch (up to --max-batch, waiting --max-wait-ms) and runs it. No two CUDA calls race, and there is exactly one set of KV-caches in flight — so concurrency can't OOM the box.
  • True batched decoding. generate_batch() left-pads ragged prompts and builds a 4D additive mask (causal + pad) that NanoTransformer.forward already accepts (its dense-mask SDPA path), so prompts of different lengths decode together with a shared KV-cache and per-row EOS stop. Per-row sampling mirrors NanoTransformer.generate exactly → batched output is token-for-token identical to single-stream (proven by tests/test_batch_equiv.py).
  • Batching + streaming coexist. Each request carries its own queue; the worker pushes text deltas into it as tokens are produced, so every request in a batch streams independently.
  • Current limits (PoC). Static micro-batching (a batch starts and finishes together). For heavy, time-skewed load the next step is continuous batching (adding requests to an in-flight batch). One model per process; greedy is the default, sampling params are per-request.
python tests/test_batch_equiv.py     # run after touching batching/masking: batched == single-stream

What's inside

The Skylar models use a custom decoder (NanoTransformer, Qwen3-style: RMSNorm + RoPE + GQA + QK-Norm + SwiGLU), trained 100% from scratch (no third-party pretrained weights). This package vendors the architecture so the published weights load anywhere — no private framework needed.

License

Apache-2.0. Models & code IP: A. Ivanovitch (Sophia AI).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skylar-0.3.0.tar.gz (41.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skylar-0.3.0-py3-none-any.whl (43.4 kB view details)

Uploaded Python 3

File details

Details for the file skylar-0.3.0.tar.gz.

File metadata

  • Download URL: skylar-0.3.0.tar.gz
  • Upload date:
  • Size: 41.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for skylar-0.3.0.tar.gz
Algorithm Hash digest
SHA256 ceddcdd9165630a1ce977f09cc4ee76ae03907ee5849c32de7131616fb8b42df
MD5 31a470da0ad2089903bc499b7739c1e1
BLAKE2b-256 5f9944033208e59992bbaee9432b134c11febd5decb66513701e57eab5e8cc26

See more details on using hashes here.

File details

Details for the file skylar-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: skylar-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 43.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for skylar-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 58447241e9c98dc99f8a1659446e27c8038e0b980d0a78c139ba3fbcb6d9868f
MD5 85c27f3af606f1a4b2075d6ac47d23d3
BLAKE2b-256 16e0b8e4866bb595d77a1edc3367dc8b4b83f9643748b19a8da7828c8606c4ba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page