Skip to main content

Accelerate Hugging Face Transformers on Rockchip NPUs.

Project description

RK-Transformers: Accelerate Hugging Face Transformers on Rockchip NPUs

huggingface Docs - GitHub.io Python 3.10-3.12 PyPI - Version GitHub Actions Workflow Status Status GitHub - License Star on GitHub

RK-Transformers is a runtime library that seamlessly integrates Hugging Face transformers and sentence-transformers with Rockchip's RKNN Neural Processing Units (NPUs). It enables efficient and facile deployment of transformer models on edge devices powered by Rockchip SoCs (RK3588, RK3576, etc.).

✨ Key Features

🔄 Model Export & Conversion

  • Automatic ONNX Export: Converts Hugging Face models to ONNX with input detection
  • RKNN Optimization: Exports to RKNN format with configurable optimization levels (0-3)
  • Quantization: INT8 (w8a8) quantization with calibration dataset support
  • Push to Hub: Direct integration with Hugging Face Hub for model versioning

⚡ High-Performance Inference

  • NPU Acceleration: Leverage Rockchip's hardware NPU for 10-20x speedup
  • Multi-Core Support: Automatic core selection and load balancing across NPU cores
  • Memory Efficient: Optimized for edge devices with limited RAM

🧩 Framework Integration

  • Sentence Transformers: Drop-in replacement with RKSentenceTransformer and RKCrossEncoder
  • Transformers API: Compatible with standard Hugging Face pipelines

📦 Installation

Prerequisites

  • Python 3.10 - 3.12
  • Linux-based OS (Ubuntu 24.04+ recommended)
  • For export: PC with x86_64/arm64 architecture
  • For inference: Rockchip device with RKNPU2 support (RK3588, RK3576, etc.)

Quick Install

uv is recommended for faster installation and smaller environment footprint.

For Inference (on Rockchip devices [arm64])

uv venv
uv pip install rk-transformers[inference]

This installs runtime dependencies including:

  • rknn-toolkit-lite2 (2.3.2)
  • sentence-transformers (5.x)
  • numpy, torch, transformers

For Model Export (on development machines [x86_64, arm64])

uv venv
uv pip install rk-transformers[dev,export]
uv pip install torch==2.6.0+cpu --index-url https://download.pytorch.org/whl/cpu # workaround for rknn-toolkit2 dependency

This installs export dependencies including:

  • rknn-toolkit2 (2.3.2)
  • sentence-transformers (5.x)
  • numpy, torch, transformers, optimum[onnx], datasets

For Development (on development machines [x86_64, arm64])

# Clone the repository
git clone https://github.com/emapco/rk-transformers.git
cd rk-transformers

# Install with development tools
uv venv
uv pip install -e .[dev,export]
uv pip install torch==2.6.0+cpu --index-url https://download.pytorch.org/whl/cpu # workaround for rknn-toolkit2 dependency

🎯 Quick Start

1. Export a Model to RKNN

# Display help message with available options
rk-transformers-cli export -h 

# Export a Sentence Transformer model from Hugging Face Hub (float16)
rk-transformers-cli export \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --platform rk3588 \
  --flash-attention \
  --optimization-level 3

# Export with custom dataset for quantization (int8)
rk-transformers-cli export \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --platform rk3588 \
  --flash-attention \
  --quantize \
  --dtype w8a8 \
  --dataset sentence-transformers/natural-questions \
  --dataset-split train \
  --dataset-columns answer \
  --dataset-size 128 \
  --max-seq-length 128 # Default is 512

# Export a local ONNX model
rk-transformers-cli export \
  --model ./my-model/model.onnx \
  --platform rk3588 \
  --flash-attention \
  --batch-size 4 # Default is 1

2. Run Inference with Sentence Transformers

SentenceTransformer

from rktransformers import RKSentenceTransformer

model = RKSentenceTransformer(
    "rk-transformers/all-MiniLM-L6-v2",
    model_kwargs={
        "platform": "rk3588",
        "core_mask": "all",
    },
)

sentences = ["This is a test sentence", "Another example"]
embeddings = model.encode(sentences)
print(embeddings.shape)  # (2, 384)

# Load specific quantized model file
model = RKSentenceTransformer(
    "rk-transformers/all-MiniLM-L6-v2",
    model_kwargs={
        "platform": "rk3588",
        "file_name": "rknn/model_w8a8.rknn",
    },
)

CrossEncoder

from rktransformers import RKCrossEncoder

model = RKCrossEncoder(
    "rk-transformers/ms-marco-MiniLM-L12-v2",
    model_kwargs={"platform": "rk3588", "core_mask": "auto"},
)

pairs = [
    ["How old are you?", "What is your age?"],
    ["Hello world", "Hi there!"],
    ["What is RKNN?", "This is a test."],
]
scores = model.predict(pairs)
print(scores)

query = "Hi there!"
documents = [
    "What is going on?",
    "I am 25 years old.",
    "This is a test.",
    "RKNN is a neural network toolkit.",
]
results = model.rank(query, documents)
print(results)

# Load specific quantized model file
model = RKCrossEncoder(
    "rk-transformers/ms-marco-MiniLM-L12-v2",
    model_kwargs={
        "platform": "rk3588",
        "file_name": "rknn/model_w8a8.rknn",
    },
)

3. Use RK-Transformers API

View the docs for all supported models and their example usage.

from transformers import AutoTokenizer

from rktransformers import RKModelForFeatureExtraction

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("rk-transformers/all-MiniLM-L6-v2")
model = RKModelForFeatureExtraction.from_pretrained("rk-transformers/all-MiniLM-L6-v2", platform="rk3588", core_mask="auto")

# Tokenize and run inference
inputs = tokenizer(
    ["Sample text for embedding"],
    padding="max_length",
    truncation=True,
    return_tensors="np",
)

outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(axis=1)  # Mean pooling
print(embeddings.shape)  # (1, 384)

# Load specific quantized model file
model = RKModelForFeatureExtraction.from_pretrained(
    "rk-transformers/all-MiniLM-L6-v2", platform="rk3588", file_name="rknn/model_w8a8.rknn"
)

4. Use Transformers Pipelines

from transformers import pipeline

from rktransformers import RKModelForMaskedLM

# Load the RKNN model
model = RKModelForMaskedLM.from_pretrained(
    "rk-transformers/bert-base-uncased", platform="rk3588", file_name="rknn/model_w8a8.rknn"
)

# Create a fill-mask pipeline with the RKNN-accelerated model
fill_mask = pipeline(
    "fill-mask",
    model=model,
    tokenizer="rk-transformers/bert-base-uncased",
    framework="pt",  # required for RKNN
)

# Run inference
results = fill_mask("Paris is the [MASK] of France.")
print(results)

⚙️ NPU Core Configuration

Rockchip SoCs with multiple NPU cores (like RK3588 with 3 cores or RK3576 with 2 cores) support flexible core allocation strategies through the core_mask parameter. Choosing the right core mask can optimize performance based on your workload and system conditions. For more details, refer to the RK-Transformers docs.

Available Core Mask Options

Note: core_mask is specified at inference time.

Value Description Use Case
"auto" Automatic mode - selects idle cores dynamically Recommended: Best for most scenarios, RKNN runtime provides load balancing
"0" NPU Core 0 only Fixed core assignment
"1" NPU Core 1 only Fixed core assignment
"2" NPU Core 2 only Fixed core assignment (RK3588 only)
"0_1" NPU Core 0 and 1 simultaneously Parallel execution across 2 cores for larger models
"0_1_2" NPU Core 0, 1, and 2 simultaneously Maximum parallelism (RK3588 only) for demanding models
"all" All available NPU cores Equivalent to "0_1_2" on RK3588, "0_1" on RK3576

Usage Examples

RK-Transformers API

from rktransformers import RKModelForFeatureExtraction

# Auto-select idle cores (recommended for production)
model = RKModelForFeatureExtraction.from_pretrained("rk-transformers/all-MiniLM-L6-v2", platform="rk3588", core_mask="auto")

# Use specific core for dedicated workloads
model = RKModelForFeatureExtraction.from_pretrained(
    "rk-transformers/all-MiniLM-L6-v2",
    platform="rk3588",
    core_mask="1",  # Reserve core 0 for other tasks
)

# Use all cores for maximum performance
model = RKModelForFeatureExtraction.from_pretrained("rk-transformers/all-MiniLM-L6-v2", platform="rk3588", core_mask="all")

Sentence Transformers Integration

from rktransformers import RKSentenceTransformer, RKCrossEncoder

model = RKSentenceTransformer(
    "rk-transformers/all-MiniLM-L6-v2",
    model_kwargs={
        "platform": "rk3588",
        "core_mask": "auto",
    },
)

model = RKCrossEncoder(
    "rk-transformers/ms-marco-MiniLM-L12-v2",
    model_kwargs={
        "platform": "rk3588",
        "core_mask": "auto",
    },
)

Architecture

Runtime Loading Workflow

  1. Model Discovery: RKModel.from_pretrained() searches for .rknn files
  2. Config Matching: Reads the rknn config in config.json to match platform and constraints
  3. Platform Validation: Checks compatibility with RKNNLite.list_support_target_platform()
  4. Runtime Init: Loads model to NPU with specified core mask
  5. Inference: Runs forward pass with automatic input/output handling

Cross-Component Communication

graph TB
    subgraph "Export Pipeline"
        HF[Hugging Face Model]
        OPT[Optimum ONNX Export]
        ONNX[ONNX Model]
        RKNN_TK[RKNN Toolkit]
        RKNN_FILE[.rknn File]
        
        HF -->|main_export| OPT
        OPT -->|ONNX graph| ONNX
        ONNX -->|load_onnx| RKNN_TK
        RKNN_TK -->|build/export| RKNN_FILE
    end
    
    subgraph "Inference Pipeline"
        RKNN_FILE -->|load| RKNN_LITE[RKNNLite Runtime]
        RKNN_LITE -->|init_runtime| NPU[RKNPU2 Hardware]
        NPU -->|inference| RESULTS[Model Outputs]
    end
    
    subgraph "Framework Integration"
        ST[Sentence Transformers]
        RKST[RKSentenceTransformer]
        RKCE[RKCrossEncoder]
        RKRT[RKModel Classes]
        HFT[Hugging Face Transformers]
        
        ST -->|subclasses| RKST
        ST -->|subclasses| RKCE
        RKST -->|load_rknn_model| RKRT
        RKCE -->|load_rknn_model| RKRT
        RKRT -->|inference| RKNN_LITE
        HFT -->|pipeline| RKRT
    end
    
    style NPU fill:#ff9900
    style RKNN_TK fill:#66ccff
    style RKNN_LITE fill:#66ccff

Configuration Files

config.json

The RKNN configuration is stored within the model's config.json file under the "rknn" key:

{
  "architectures": ["BertModel"],
  ...
  "rknn": {
    "model.rknn": {
      "platform": "rk3588",
      "batch_size": 1,
      "max_seq_length": 128,
      "model_input_names": ["input_ids", "attention_mask"],
      "quantized_dtype": "w8a8",
      "optimization_level": 3,
      ...
    },
    "rknn/optimized.rknn": {
      ...
    }
  }
}

The keys in the "rknn" object are relative paths to .rknn files, allowing multiple optimized variants per model.

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

📄 License

This project is licensed under the Apache License 2.0.

🙏 Acknowledgments

  • Hugging Face for the transformers, sentence-transformers and optimum libraries
  • Rockchip for RKNN toolkit and NPU hardware

🔗 Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rk_transformers-0.3.1.tar.gz (6.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rk_transformers-0.3.1-py3-none-any.whl (72.5 kB view details)

Uploaded Python 3

File details

Details for the file rk_transformers-0.3.1.tar.gz.

File metadata

  • Download URL: rk_transformers-0.3.1.tar.gz
  • Upload date:
  • Size: 6.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rk_transformers-0.3.1.tar.gz
Algorithm Hash digest
SHA256 ae1ae2979a8d54be7516ac95f8a72855e1d56e61052e71d5e488822570bd87cf
MD5 15f3e06735ada087e5e591ce188d1275
BLAKE2b-256 d9cae7d9210e75b481ae3c6ff8c35592613d6cc0bea8bbee60195919b65a1e6a

See more details on using hashes here.

Provenance

The following attestation bundles were made for rk_transformers-0.3.1.tar.gz:

Publisher: release.yaml on emapco/rk-transformers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rk_transformers-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for rk_transformers-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dd46fd63bd0dfab2df21d53c24603c173ca45744b3a2edf8c03bf3562f39b427
MD5 f8cf14e417eebc85958599f98bbddd34
BLAKE2b-256 b84852b1630b39932b9595b741328f6d21923d677fc9dea2b084eddc21eaa68d

See more details on using hashes here.

Provenance

The following attestation bundles were made for rk_transformers-0.3.1-py3-none-any.whl:

Publisher: release.yaml on emapco/rk-transformers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page