Skip to main content

LLaVA-style graft adding vision-language capability to Mistral-family decoders (Schneewolf Labs — Project Artemis).

Project description

Artemis — Schneewolf Labs

License: Apache 2.0

A LLaVA-style graft that adds vision-language capability to any Mistral-family text decoder without modifying the decoder. Built originally for the Schneewolf Labs A-series, but architecturally Mistral-Nemo agnostic — point it at any Mistral-class checkpoint (A2, A3, Mahou, Flammades, etc.) and you get an ArtemisVLM around it.

Path B by design

       PIL Image
           │
           ▼
   ┌───────────────────────┐
   │  Qwen3-VL ViT         │  patches → ViT layers → merger
   │  (FROZEN, pixels only)│
   └───────────────────────┘
           │  N vectors of dim out_hidden_size
           ▼
   ┌───────────────────────┐
   │  Projector (trained)  │  2-layer MLP, out_hidden → text_hidden
   │  ~45M params          │
   └───────────────────────┘
           │  N vectors in the text decoder's hidden space
           ▼
   ┌───────────────────────────────────────────────────────────────┐
   │  Mistral-family decoder (FROZEN in Stage-1, full-FT Stage-2)   │
   │  At each <|image_pad|> position, OVERWRITE the embedding with │
   │  the next projector vector. Then run as a normal decoder.     │
   └───────────────────────────────────────────────────────────────┘
           │
           ▼
       text output (decoder's own vocab — Qwen vocab never seen)

The vision tower processes pixels (no text tokens). The projector bridges hidden spaces, not token spaces. The decoder is byte-identical to the underlying Mistral checkpoint — its vocab, weights, chat template, reasoning, tool calling, and identity are preserved by construction.

Install

pip install artemis-vlm

Or, from source:

git clone https://github.com/Schneewolf-Labs/Artemis.git
cd Artemis
pip install -e .

Requires transformers>=5.0.0, torch>=2.5.0, Pillow.

Quick start — load a pretrained Artemis checkpoint

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import artemis_vlm  # registers ArtemisVLM with AutoConfig / AutoModel

REPO = "schneewolflabs/A3-preview"  # or any ArtemisVLM checkpoint

model = AutoModelForCausalLM.from_pretrained(REPO, dtype=torch.bfloat16).to("cuda").eval()
tok = AutoTokenizer.from_pretrained(REPO)
processor = artemis_vlm.ArtemisVLMProcessor(
    tokenizer=tok, vision_config=model.visual.config,
    min_pixels=32 * 32, max_pixels=512 * 512,
)

from PIL import Image
image = Image.open("photo.jpg")
messages = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
batch = processor(text=text, images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
    out = model.generate(**batch, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0][batch["input_ids"].shape[1]:], skip_special_tokens=True))

Quick start — build a new graft from your own checkpoints

import torch
import artemis_vlm
from transformers import Qwen3VLForConditionalGeneration

# Take the vision tower from a pretrained Qwen3-VL checkpoint
qv = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-2B-Instruct", dtype=torch.bfloat16,
)
vision = qv.model.visual
del qv  # free the Qwen3-VL decoder we don't need

# Graft onto any Mistral-class text checkpoint
model = artemis_vlm.ArtemisVLMForConditionalGeneration.from_a2_and_vision(
    "schneewolflabs/A2",  # or any Mistral-Nemo finetune
    vision_model=vision,
    image_token_id=22,    # repurposed <|image_pad|> in A-series Tekken vocab
    torch_dtype=torch.bfloat16,
)

# Stage-1: train only the projector (~45M params)
trainable, total = model.set_training_stage("stage1")
print(f"Stage-1: trainable={trainable/1e6:.1f}M / total={total/1e9:.2f}B")

Training (Stage-1 / Stage-2)

set_training_stage("stage1") freezes the ViT and the decoder, leaving only the projector trainable — the "alignment" phase. set_training_stage("stage2") unfreezes the decoder for the visual-instruction phase.

The recommended trainer is Schneewolf-Labs/Merlina, which exposes Artemis training as training_mode: "vlm_stage1" / "vlm_stage2" on its REST API. The ArtemisDataCollator here is data_collator=-compatible with any trainer that consumes a custom collator (Grimoire, accelerate-driven loops, HF Trainer).

Key implementation notes

  • Merged vision features. Qwen3VLVisionModel.forward() returns pre-merge features on last_hidden_state and merged features on pooler_output. We use pooler_output (matches the merger's downstream-consumer contract).
  • Patch / merge sizes come from vision_config. Qwen3-VL uses patch_size=16; Qwen2-VL's image processor defaults to patch_size=14. The processor sources patch / temporal / merge from vision_config so the <|image_pad|> expansion count can never drift from the model's merged feature count.
  • Image token splice. At each <|image_pad|> position in the prompt, the input embedding is overwritten with the next projector vector (via masked_scatter). The decoder sees a normal token sequence where some embeddings happen to come from vision instead of embed_tokens.
  • DeepStack / Interleaved-MRoPE are intentionally NOT used. Those are decoder-modification ("Path A") tricks. We chose Path B (composition).
  • Untied weights. A-series decoders have untied embed_tokens and lm_head. ArtemisVLMForConditionalGeneration.all_tied_weights_keys = {} is declared explicitly for transformers 5.x compatibility.

Tests

Four hardware-bound smoke tests live in tests/. They require a real checkpoint on disk + an ML stack + a CUDA device, so they skip cleanly under pytest (CI won't try to run them) and are meant to be invoked as python tests/test_artemis_<name>.py on the development machine.

python tests/test_artemis_vlm.py        # model assembly + forward
python tests/test_artemis_processor.py  # chat template ↔ pad expansion
python tests/test_artemis_collator.py   # multimodal batching
python tests/test_artemis_stage_gen.py  # staged-freeze + generate()

Published checkpoints

Checkpoint Status Notes
schneewolflabs/A3-preview public, apache-2.0 25k-sample Stage-1 smoke (proof-of-concept)
schneewolflabs/A3 training (Stage-1, 1M samples) first real release
schneewolflabs/Artemis planned (post Stage-2) named flagship

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

artemis_vlm-0.1.2.tar.gz (25.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

artemis_vlm-0.1.2-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file artemis_vlm-0.1.2.tar.gz.

File metadata

  • Download URL: artemis_vlm-0.1.2.tar.gz
  • Upload date:
  • Size: 25.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for artemis_vlm-0.1.2.tar.gz
Algorithm Hash digest
SHA256 f2daeceb897f70f4a237238f9d30ca4ac2196d541f8eeb088cd4cbea290178b0
MD5 a7376df261bd7d7846bb03de01049692
BLAKE2b-256 766a1cf477ec54ae850513865592f22ccf0780325ee37dc980fc3cfd91a8faa7

See more details on using hashes here.

Provenance

The following attestation bundles were made for artemis_vlm-0.1.2.tar.gz:

Publisher: release.yml on Schneewolf-Labs/Artemis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file artemis_vlm-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: artemis_vlm-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 20.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for artemis_vlm-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0d4e9a6cef764eb4ace5e732c5324f93cfef3ffdffe43b5b477f3325a1903652
MD5 bcdc5ef740d16d03dfcbe6cd9b54f746
BLAKE2b-256 2a6877a97d339c7ead0e8edede80fb368174b6cf7d12e959198ca18210b7eeff

See more details on using hashes here.

Provenance

The following attestation bundles were made for artemis_vlm-0.1.2-py3-none-any.whl:

Publisher: release.yml on Schneewolf-Labs/Artemis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page