Skip to main content

Skylar — local, sovereign, from-scratch LLMs. CLI + loader for the Skylar model family: generative chat, embeddings, and a COBOL specialist.

Project description

skylar

A tiny runtime + CLI for the Skylar model family — local, sovereign, from-scratch LLMs you load, run, and serve with one pip install. It covers generative chat, embeddings / retrieval, and a COBOL code specialist — 236M–390M class, runnable on a single GPU or CPU, no data leaving your machine.

Models live under Sophia-AI on HuggingFace: Skylar-236M-Base · Skylar-236M-Chat · Skylar-236M-Embed · Skylar-390M-Cobol.

Install

pip install skylar
# optional HTTP server:
pip install "skylar[serve]"

Use it — CLI

# chat with any Skylar generative model (no forced persona — steer it with --system)
skylar chat --model Sophia-AI/Skylar-236M-Chat --system "Sei un assistente che risponde dal contesto."

# embeddings / retrieval (any SkylarEmbedder model)
skylar embed --model Sophia-AI/Skylar-236M-Embed --query "prestito casa" --docs "mutuo" "meteo"

# one-shot generation (HF repo id or a local checkpoint dir)
skylar generate --model Sophia-AI/Skylar-236M-Chat --prompt "..."

# the COBOL specialist — completes a COBOL stub into a full, compilable program
#   (auto-downloads Skylar-390M-Cobol; it's a stub completer, not a chatbot)
skylar cobol --example
skylar cobol --stub-file my_task.cbl --compile        # your own stub + GnuCOBOL check

# multi-user OpenAI-compatible server (needs the [serve] extra) — full details in "Serve it" below
skylar serve --model <any-skylar-model> --port 8000     # interactive docs at http://localhost:8000/docs

Decoding is greedy by default (--temperature 0.0); there is no forced system prompt — pass --system "..." to steer a chat model. (The skylar cobol subcommand handles the COBOL prompt format for you.)

Use it — Python

import skylar

# generative chat — pass your own system prompt (no forced persona)
m = skylar.load("Sophia-AI/Skylar-236M-Chat")          # HF repo id or a local dir
print(m.generate("Domanda: dove ha sede la Banca d'Italia?",
                 system="Rispondi solo dal contesto fornito."))
for delta in m.stream("..."):                          # streaming
    print(delta, end="", flush=True)

# embeddings / retrieval
e = skylar.load_embedder("Sophia-AI/Skylar-236M-Embed")
ranked = e.rank("costo del denaro", ["la BCE alza i tassi", "ricetta pizza"])

# the COBOL specialist — a stub completer (not a chatbot)
c = skylar.load("Sophia-AI/Skylar-390M-Cobol")
print(c.complete_cobol(my_stub))                       # -> full, compilable COBOL program

skylar also registers the architecture with 🤗 Transformers, so this works too:

import skylar  # registers nano-transformer
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Sophia-AI/Skylar-236M-Chat")

Serve it — multi-user HTTP API (skylar serve)

skylar serve --model <id> turns any Skylar generative model into an OpenAI-compatible HTTP server built for concurrent users. Requests from many clients are fused into dynamic micro-batches on a single worker that owns the model — so one GPU (or CPU) serves a whole demo without per-request OOM or GPU contention — and each request can stream its tokens.

pip install "skylar[serve]"
skylar serve --model Sophia-AI/Skylar-236M-Chat          # swap the id for ANY Skylar model
#  → http://127.0.0.1:8000   ·   interactive docs: http://127.0.0.1:8000/docs

Open /docs for the auto-generated Swagger UI — every endpoint, schema, and example is described there (or /redoc for ReDoc). The model is whatever you pass to --model (an HF repo id or a local checkpoint dir); an embedder model is auto-detected and served at /v1/embeddings instead.

Method & path What it does
POST /v1/chat/completions OpenAI chat format. "stream": true → Server-Sent Events.
POST /generate One prompt → one completion.
GET /health Liveness + which model/device is loaded.
GET /metrics Throughput, batch sizes, queue depth.
# one-shot completion
curl localhost:8000/generate -H 'content-type: application/json' \
  -d '{"prompt": "Dove ha sede la Banca d'\''Italia?", "max_new_tokens": 64}'

# OpenAI chat format (+ "stream": true for SSE)
curl -N localhost:8000/v1/chat/completions -H 'content-type: application/json' -d '{
  "messages": [{"role":"system","content":"Sei un esperto COBOL."},
               {"role":"user","content":"Somma due campi PIC 9(4)."}],
  "max_tokens": 256, "stream": true
}'

Drop-in with the official OpenAI client — just point base_url at the server:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(
    model="Sophia-AI/Skylar-236M-Chat",
    messages=[{"role": "user", "content": "Spiega cosa fa questo COBOL ..."}],
    stream=True,
)
for chunk in r:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Tuning concurrency

Flag Default Meaning
--max-batch 8 Max requests fused into one forward pass. Raise for more throughput until VRAM/latency says stop.
--max-wait-ms 15 How long to wait for stragglers before launching a batch. Higher = bigger batches, slightly more latency.
--max-queue 256 Input-queue depth; beyond it the server returns 503 (backpressure instead of OOM).

How it works (for implementers / a future maintainer)

The server is skylar/serve.py — the model code (decoder.py / attention.py) is left untouched:

  • One worker owns the model. Async routes enqueue requests; a single background thread pulls a micro-batch (up to --max-batch, waiting --max-wait-ms) and runs it. No two CUDA calls race, and there is exactly one set of KV-caches in flight — so concurrency can't OOM the box.
  • True batched decoding. generate_batch() left-pads ragged prompts and builds a 4D additive mask (causal + pad) that NanoTransformer.forward already accepts (its dense-mask SDPA path), so prompts of different lengths decode together with a shared KV-cache and per-row EOS stop. Per-row sampling mirrors NanoTransformer.generate exactly → batched output is token-for-token identical to single-stream (proven by tests/test_batch_equiv.py).
  • Batching + streaming coexist. Each request carries its own queue; the worker pushes text deltas into it as tokens are produced, so every request in a batch streams independently.
  • Current limits (PoC). Static micro-batching (a batch starts and finishes together). For heavy, time-skewed load the next step is continuous batching (adding requests to an in-flight batch). One model per process; greedy is the default, sampling params are per-request.
python tests/test_batch_equiv.py     # run after touching batching/masking: batched == single-stream

What's inside

The Skylar models use a custom decoder (NanoTransformer, Qwen3-style: RMSNorm + RoPE + GQA + QK-Norm + SwiGLU), trained 100% from scratch (no third-party pretrained weights). This package vendors the architecture so the published weights load anywhere — no private framework needed.

License

Apache-2.0. Models & code IP: A. Ivanovitch (Sophia AI).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skylar-0.3.2.tar.gz (42.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skylar-0.3.2-py3-none-any.whl (44.2 kB view details)

Uploaded Python 3

File details

Details for the file skylar-0.3.2.tar.gz.

File metadata

  • Download URL: skylar-0.3.2.tar.gz
  • Upload date:
  • Size: 42.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for skylar-0.3.2.tar.gz
Algorithm Hash digest
SHA256 320e23809d217161c0f6b0b376bf0bd50c591af30f385ac92a45ef417dea4c18
MD5 869f30ccbddd87f66938672b10dc9b2e
BLAKE2b-256 79b06e751851afdb3bbca51b39d4c81229fa4d5bbae8fe5686bbfb15500045b5

See more details on using hashes here.

File details

Details for the file skylar-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: skylar-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 44.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for skylar-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0d89895f15cab8bbe016493a6fecdabc5a63f77d5c5c0a3f29359095adf4536e
MD5 8aa052009ce5b1baf3fa544e4ab71d2e
BLAKE2b-256 03dbbfa65997aa802387bebbac969e8dd20cb5043e9bf2b41933e0fe4668e5ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page