A high-performance Flask server tailored for running Large Language Models (LLMs) locally on Intel NPUs using OpenVINO GenAI.

These details have not been verified by PyPI

Project links

Project description

NPU Server (`npuserver`)

A high-performance Python library and Flask backend tailored specifically for running Large Language Models (LLMs) locally on Intel NPUs using OpenVINO GenAI.

This server provides an OpenAI-compatible API for seamless integration with existing tools, robust NPU memory management, and dynamic on-the-fly hardware compilation of Hugging Face models into optimized NPU blobs.

Core Features

🚀 OpenAI-Compatible API: Seamlessly integrate with any existing LLM tooling (like LangChain, AutoGen, or custom frontends) using the standard /v1/chat/completions endpoint. Fully supports real-time Server-Sent Event (SSE) streaming.
🧠 Strict Memory Management: The Intel NPU has limited, highly specialized memory. This server gives you complete explicit control over it. Load and unload models programmatically while aggressively garbage-collecting to prevent NPU memory leaks.
⚡ On-The-Fly Compilation: If a downloaded Hugging Face model hasn't been compiled for the NPU, the server intelligently intercepts the load request and dynamically compiles an optimized OpenVINO .blob before serving it.
🚫 No Background Downloads: To prevent runaway bandwidth usage and unexpected latency, the server strictly enforces that models must be downloaded locally before it attempts to load or compile them.
🌐 Static Model Registry: Ships with a fully decoupled, static HTML dashboard (serve/index.html) designed for easy hosting on GitHub Pages to cleanly display your available models without exposing backend connections.

📦 Installation

Ensure you have Python installed and your Intel NPU drivers configured properly on Windows.

# 1. Clone the repository
git clone https://github.com/durgasai299792458/npuserver.git
cd npuserver

# 2. Setup a virtual environment
python -m venv venv
venv\Scripts\activate

# 3. Install the package locally
pip install -e .

Required Core Dependencies: openvino-genai, flask, huggingface-hub

🛠️ Usage Guide

1. Starting the Server

The server runs on Flask. You can spin it up programmatically using the library:

from npuserver import run_server

# Starts the NPU backend on port 8080
run_server(port=8080)

2. Using the Python Client Library

You can remotely control the server's NPU memory directly from your Python scripts using the built-in client functions, without needing to write raw HTTP requests manually.

import npuserver

# 1. Fetch a list of all available models
models = npuserver.get_models_status(api_base_url="http://localhost:8080")
for m in models:
    print(f"{m['name']} is {m['status']}")

# 2. Explicitly load a model into NPU memory
print("Loading model...")
npuserver.load_model("durgasai299792458/Qwen3-4B-OpenVINO-INT4-npu-i")

# 3. Completely wipe the NPU memory and free resources
npuserver.unload_model()

# 4. Delete only compiled OpenVINO NPU cache files for a model
npuserver.delete_compiled("durgasai299792458/Qwen3-4B-OpenVINO-INT4-npu-i")

# 5. Completely delete compiled files AND downloaded HuggingFace snapshots
npuserver.delete("durgasai299792458/Qwen3-4B-OpenVINO-INT4-npu-i")

3. Downloading Models Manually

Because the server strictly refuses to download gigabytes of data in the background, you must ensure the model exists in the cache first. You can download compatible models (like Qwen2.5-3B-OpenVINO-INT4-npu or gemma-4-E2B-OpenVINO-INT4) directly using Hugging Face utilities, or the server will read them if they are cached natively.

📡 HTTP API Reference

If you are building your own frontend or using standard REST clients (like curl or Postman), use these endpoints:

🧠 Model Memory Management

`POST /load`

Loads a specific model into NPU memory. If another model is currently active, it is safely ejected and garbage collected first. If the requested model is downloaded but uncompiled, it halts to compile it first.

curl -X POST http://localhost:8080/load \
     -H "Content-Type: application/json" \
     -d '{"model": "durgasai299792458/Qwen3-4B-OpenVINO-INT4-npu-i"}'

`POST /unload`

Explicitly unloads the currently active model and triggers Python garbage collection to flush the NPU VRAM immediately.

curl -X POST http://localhost:8080/unload

`GET /models`

Returns a consolidated JSON array of all models (active, compiled, and available remotely on your GitHub Pages registry).

`GET /health`

Returns the server's status (e.g., whether it is idle or currently processing a generation task).

`POST /delete` or `POST /v1/models/delete`

Deletes a model's compiled cache files and/or downloaded Hugging Face weights from the server.

curl -X POST http://localhost:8080/delete \
     -H "Content-Type: application/json" \
     -d '{"model": "durgasai299792458/Qwen3-4B-OpenVINO-INT4-npu-i", "compiled_only": true}'

`DELETE /v1/models/<model_name>`

Deletes a model's files from disk. Setting query parameter ?compiled_only=true deletes only the compiled blob folder.

curl -X DELETE http://localhost:8080/v1/models/durgasai299792458/Qwen3-4B-OpenVINO-INT4-npu-i?compiled_only=false

💬 Inference

`POST /v1/chat/completions`

A standard OpenAI chat endpoint. Strict Execution Note: You MUST call /load before sending a chat completion request. The server will reject the request with a 400 Bad Request if the NPU memory is empty.

curl -X POST http://localhost:8080/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
       "messages": [
         {"role": "system", "content": "You are a helpful AI assistant running on an Intel NPU."},
         {"role": "user", "content": "Write a short poem about microprocessors."}
       ],
       "max_tokens": 2048,
       "temperature": 0.7,
       "stream": true
     }'

📂 Cache & File Structure

By default, npuserver neatly organizes heavy model files in your user cache directory so your code repository stays clean and lightweight.

On Windows, models are stored at: C:\Users\<username>\.cache\npuserver\

...\hf\: Stores the raw weights and safetensors downloaded directly from the Hugging Face hub.
...\compiled\: Stores the optimized .blob files successfully compiled by OpenVINO GenAI.

Troubleshooting Tip: If a hardware compilation ever fails, gets interrupted, or becomes corrupted, simply navigate to the compiled\ directory, delete the specific model's folder, and the server will safely attempt to re-compile it from scratch on your next /load request!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.4.1

Jun 16, 2026

1.4.0

Jun 16, 2026

1.3.1

Jun 10, 2026

1.3.0

Jun 10, 2026

This version

1.2.0

Jun 10, 2026

1.1.3

Jun 10, 2026

1.1.2

Jun 9, 2026

1.1.1

Jun 9, 2026

1.0.2

Jun 2, 2026

0.1.0

May 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

npuserver-1.2.0.tar.gz (16.9 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

npuserver-1.2.0-py3-none-any.whl (16.2 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file npuserver-1.2.0.tar.gz.

File metadata

Download URL: npuserver-1.2.0.tar.gz
Upload date: Jun 10, 2026
Size: 16.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for npuserver-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`bb0bcd98171dc879385b03268b902822a4dbfce7c0c45e308f164ebd2b824316`
MD5	`0c8d90bd23420048fee69a4b86b05989`
BLAKE2b-256	`d95a0d2c8df0a73df1dfbf221feb801eaec2eaef62be5c083ee20a845e1e0cf0`

See more details on using hashes here.

File details

Details for the file npuserver-1.2.0-py3-none-any.whl.

File metadata

Download URL: npuserver-1.2.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 16.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for npuserver-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ab80443d48f8773fd4ec39d3a4081b530537f6d6f110da57826441709740bf90`
MD5	`7840af8ac6ddb7a4e993913799b9a073`
BLAKE2b-256	`3d86b2fa55bf4576ec7318f8a415dcd74a2f669592e85f69482254dac64f5641`

See more details on using hashes here.

npuserver 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NPU Server (npuserver)

Core Features

📦 Installation

🛠️ Usage Guide

1. Starting the Server

2. Using the Python Client Library

3. Downloading Models Manually

📡 HTTP API Reference

🧠 Model Memory Management

POST /load

POST /unload

GET /models

GET /health

POST /delete or POST /v1/models/delete

DELETE /v1/models/<model_name>

💬 Inference

POST /v1/chat/completions

📂 Cache & File Structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

NPU Server (`npuserver`)

`POST /load`

`POST /unload`

`GET /models`

`GET /health`

`POST /delete` or `POST /v1/models/delete`

`DELETE /v1/models/<model_name>`

`POST /v1/chat/completions`