Skip to main content

LlamaMlx-RS

Project description

LlamaMlx-RS

LlamaMlx-RS Logo

High-performance MLX models in Rust for Apple Silicon

Crates.io License Documentation Build Status

Overview

LlamaMlx-RS is a comprehensive Rust ecosystem for running MLX models on Apple Silicon devices. It provides efficient, type-safe Rust bindings to Apple's MLX framework along with high-level libraries for different ML tasks.

The ecosystem consists of the following components: e

  • Core Library: llamamlx-rs - Rust bindings to MLX with tensor operations, device management, and model loading
  • ML Libraries:
    • llama-textgen-rs - Text generation with LLMs
    • llama-embed-rs - Text embedding generation
    • llama-image-rs - Computer vision tasks (classification, detection, segmentation)
    • llama-vlm-rs - Vision-language models for multimodal processing
  • Utility Libraries:
    • llama-shard-rs - Model sharding for distributed inference
    • llama-arxiv-rs - ArXiv paper downloading and processing
    • llama-moonlight-rs - Web scraping and CAPTCHA solving
  • Integration Tools:
    • Server applications
    • CLI tools
    • Example applications

Features

  • ๐Ÿš€ High Performance: Optimized for Apple Silicon M1/M2/M3 chips
  • ๐Ÿ”„ Easy Conversion: Utilities for converting models from PyTorch/ONNX to MLX
  • ๐Ÿ“ฆ Production-Ready: Comprehensive error handling, performance monitoring, and testing
  • ๐ŸŒ Distributed Inference: Shard large models across multiple devices
  • ๐Ÿ”Œ API Compatibility: Drop-in replacement for popular APIs like OpenAI
  • ๐Ÿ“Š Flexible I/O: Load and save models, weights, and tensors in various formats
  • ๐Ÿ“ˆ Visualization: Rich tools for visualizing tensors, model outputs, and performance metrics
  • ๐Ÿงฉ Modular Design: Use only the components you need

Installation

Prerequisites

  • macOS 13+ with Apple Silicon (M1/M2/M3)
  • Rust 1.75+
  • Xcode Command Line Tools
  • Python 3.9+ (for model conversion)

Setting up the Ecosystem

# Clone the repository
git clone https://github.com/llamamlx-rs/llamamlx-rs.git
cd llamamlx-rs

# Run the setup script
./setup-ecosystem.sh

# Build all components
cargo build --release

Quick Start

Text Generation with Llama 3

use llamamlx_rs::device::Device;
use llama_textgen::{TextGenerator, GenerationOptions};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a text generator with Llama 3
    let generator = TextGenerator::new_from_path(
        "models/Llama-3-8B-iq4", 
        &Device::gpu(0)
    )?;

    // Generate text
    let options = GenerationOptions {
        temperature: 0.7,
        top_p: 0.9,
        max_tokens: 100,
        stop_sequences: vec!["\n\n".to_string()],
    };
    
    let result = generator.generate(
        "Explain quantum computing in simple terms:", 
        &options
    )?;
    
    println!("{}", result.text);

    Ok(())
}

Image Classification

use llamamlx_rs::device::Device;
use llama_image::{
    image::Image,
    classification::ImageClassifier,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load an image
    let image = Image::from_file("examples/cat.jpg")?;
    
    // Create a classifier with MobileNet
    let classifier = ImageClassifier::new_from_path(
        "models/mobilenet-v2-mlx",
        Some("models/mobilenet-v2-mlx/labels.txt"),
        &Device::gpu(0)
    )?;
    
    // Classify the image
    let result = classifier.classify(&image)?;
    
    println!("Class: {} ({:.2}% confidence)", 
        result.class_name, 
        result.confidence * 100.0
    );

    Ok(())
}

Visualization

use llamamlx_rs::{
    tensor::Array,
    visualization::{
        terminal::print_tensor_heatmap,
        file::save_classification_tsv,
    },
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a sample 2D tensor
    let data = vec![
        0.1, 0.2, 0.3, 0.4,
        0.5, 0.9, 0.8, 0.7,
        0.2, 0.3, 0.8, 0.5,
        0.4, 0.5, 0.6, 0.1,
    ];
    let tensor = Array::from_slice(&data, [4, 4]);
    
    // Display as a heatmap in the terminal
    print_tensor_heatmap(&tensor, Some("Sample Heatmap"), None, None)?;
    
    // Create classification results
    let categories = vec![
        ("Cat".to_string(), 0.85),
        ("Dog".to_string(), 0.12), 
        ("Bird".to_string(), 0.03),
    ];
    
    // Save classification results to TSV file
    save_classification_tsv(&categories, "classification.tsv")?;
    
    Ok(())
}

Using the CLI for Visualization

# Generate a heatmap visualization of a tensor from a CSV file
llamamlx visualize --input tensor_data.csv --viz-type heatmap --terminal

# Generate a classification visualization from a JSON file
llamamlx visualize --input classification.json --viz-type classification --terminal

# Create an HTML report from model results
llamamlx visualize --input results.json --viz-type report --html --output report.html

# Create a PNG plot with a specific color scheme
llamamlx visualize --input tensor_data.csv --viz-type heatmap --output heatmap.png --color-scheme viridis

Distributed Inference

use llamamlx_rs::device::Device;
use llama_shard::{
    config::ShardingConfig,
    coordinator::Coordinator,
    sharding::ShardStrategy,
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create sharding configuration
    let config = ShardingConfig::new(
        "models/Llama-3-8B-iq4".into(),
        2,  // number of shards
        ShardStrategy::LayerSharding,
    );
    
    // Create and start coordinator
    let coordinator = Coordinator::new(config)?;
    coordinator.start().await?;
    
    // In a production setup, you would run workers on different machines
    // For this example, we'll register local workers
    
    println!("Coordinator ready at localhost:50051");
    println!("Run worker instances with:");
    println!("  cargo run --bin llamashard -- worker --shard-id 0 --coordinator localhost:50051");
    println!("  cargo run --bin llamashard -- worker --shard-id 1 --coordinator localhost:50051");
    
    // Wait for Ctrl+C
    tokio::signal::ctrl_c().await?;
    coordinator.shutdown().await?;

    Ok(())
}

Available Models

Text Models

Model Size Quantization Performance (tokens/sec)
Llama 3 Instruct 8B Q4 ~30 (M2 Pro)
Llama 3 Instruct 8B Q8 ~25 (M2 Pro)
Llama 2 Chat 7B Q4 ~35 (M2 Pro)
Mistral Instruct 7B Q4 ~32 (M2 Pro)

Vision Models

Model Task Size Performance (images/sec)
MobileNet V2 Classification 14MB ~90 (M2 Pro)
YOLOv8n Detection 25MB ~45 (M2 Pro)
SegFormer-B0 Segmentation 14MB ~30 (M2 Pro)

Multimodal Models

Model Tasks Size Performance
LLaVA 1.6 VQA, Captioning 8.5GB ~5 img/sec (M2 Pro)
MobileVLM VQA, Captioning 1.5GB ~12 img/sec (M2 Pro)

Documentation

Architecture

The LlamaMlx-RS ecosystem is designed with a modular architecture:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Applications                          โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚   โ”‚ REST Server โ”‚  โ”‚ CLI Tools    โ”‚  โ”‚ GUI Apps      โ”‚    โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚              โ”‚              โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     ML Libraries                           โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚TextGen   โ”‚ โ”‚Embeddingโ”‚ โ”‚Image    โ”‚ โ”‚VLM  โ”‚ โ”‚Sharding  โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                       Core Library                         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚Tensor  โ”‚ โ”‚Device   โ”‚ โ”‚Model   โ”‚ โ”‚Graph   โ”‚ โ”‚Ops     โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     Apple MLX Framework                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Performance

LlamaMlx-RS is designed to leverage the full power of Apple Silicon, with performance comparable to or better than Python-based MLX implementations:

Model Task LlamaMlx-RS Python MLX LlamaMlx-RS vs Python
Llama 3 (8B) Generation 30 tok/s 28 tok/s 1.07x faster
MobileNet Image 90 img/s 85 img/s 1.06x faster
Embedding Embed 250 txt/s 230 txt/s 1.09x faster

Contributing

Contributions are welcome! Please check out our contribution guidelines for details.

License

LlamaMlx-RS is licensed under the MIT License.

Acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llamamlx_rs_llamasearch-0.1.0rc180.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llamamlx_rs_llamasearch-0.1.0rc180-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file llamamlx_rs_llamasearch-0.1.0rc180.tar.gz.

File metadata

File hashes

Hashes for llamamlx_rs_llamasearch-0.1.0rc180.tar.gz
Algorithm Hash digest
SHA256 c7a09b184a21c7d3a9e73a83bbde2b5b52c15c9f30ae30727b3f63c4d40fdef8
MD5 b652bbcf427ff7cbdda96c55292abdfd
BLAKE2b-256 f9a4b505dac9b9ca77040703f15b8c5a973539454bc964cda156cc135d78ba17

See more details on using hashes here.

File details

Details for the file llamamlx_rs_llamasearch-0.1.0rc180-py3-none-any.whl.

File metadata

File hashes

Hashes for llamamlx_rs_llamasearch-0.1.0rc180-py3-none-any.whl
Algorithm Hash digest
SHA256 fa6c79339ab13744833dc07d184250fef8a0bf8c6fac919b788c3368e593792e
MD5 09ba71fa1c650cb13cabe89e5aada1ee
BLAKE2b-256 8249833c8e8e6895be23b16b8f1cd91d0e42823590a6078381c368e47c104667

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page