A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs.
Project description
remote-embedding
remote-embedding packages two things together:
- A FastAPI server that exposes a
/embedAPI backed by local Hugging Face models. - A LangChain-compatible
RemoteEmbeddingsclient that calls that server remotely.
This lets multiple applications share a single loaded embedding model instance instead of each process loading its own copy. On constrained GPUs, that reduces duplicated VRAM usage and makes it easier to serve embeddings from limited hardware.
Install
pip install remote-embedding
Package Layout
The import package is remote_embedding.
from remote_embedding import RemoteEmbeddings
Run The Server
Set the environment variables your model needs. You can copy values from .env.example into your own .env file, or set them directly in the shell.
PowerShell:
$env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
$env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
$env:DEVICE="cpu"
$env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
$env:ENCODE_KWARGS='{"normalize_embeddings": true}'
Bash:
export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
export EMBEDDING_DIR=/path/to/model-cache
export DEVICE=cpu
export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
export ENCODE_KWARGS='{"normalize_embeddings": true}'
You can also configure the server with CLI flags:
remote-embedding-server \
--host 0.0.0.0 \
--port 5055 \
--model-name BAAI/bge-base-en-v1.5 \
--embedding-dir /path/to/model-cache \
--device cuda \
--model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
--encode-kwargs '{"normalize_embeddings": true}'
Start the API:
remote-embedding-server
Or:
python -m remote_embedding
Defaults:
HOST=0.0.0.0PORT=5055
CLI flags override environment variables for the current process.
Configuration
Server configuration:
HOST: bind address for the FastAPI serverPORT: bind port for the FastAPI serverEMBEDDING_MODEL_NAME: default model to preload and use when a request does not passmodel_nameEMBEDDING_DIR: optional local cache/model directory for Hugging Face downloads or local filesDEVICE: device passed toHuggingFaceEmbeddings, such ascpuorcudaMODEL_KWARGS: JSON object merged intoHuggingFaceEmbeddings(..., model_kwargs=...)ENCODE_KWARGS: JSON object passed toHuggingFaceEmbeddings(..., encode_kwargs=...)
Client configuration through RemoteEmbeddings(...):
base_url: full server URL, such ashttp://127.0.0.1:5055timeout: request timeout in secondsexpected_dimensions: optional validation for returned vector sizemodel_name: optional per-client default model name sent with each requestembedding_dir: optional per-client cache/model directory override sent with each requestmodel_kwargs: optional JSON-serializable dict sent to the server and merged intomodel_kwargsencode_kwargs: optional JSON-serializable dict sent to the server asencode_kwargs
If EMBEDDING_MODEL_NAME is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.
model_kwargs and encode_kwargs become part of the server-side model cache key. That means different combinations can create different loaded embedding instances, which is flexible but can reduce the VRAM-sharing benefit if overused.
Use The Client
from remote_embedding import RemoteEmbeddings
embeddings = RemoteEmbeddings(
base_url="http://127.0.0.1:5055",
timeout=120,
expected_dimensions=768,
model_name="BAAI/bge-base-en-v1.5",
embedding_dir="C:/models/cache",
model_kwargs={"local_files_only": True, "trust_remote_code": True},
encode_kwargs={"normalize_embeddings": True},
)
docs = embeddings.embed_documents(["hello world", "remote embeddings"])
query = embeddings.embed_query("search text")
RAG Pipeline Usage
If your RAG pipeline currently loads a local embedding model inside each application process, you can replace that with RemoteEmbeddings and route embedding calls to one shared server.
Before:
from langchain_huggingface import HuggingFaceEmbeddings
embed_model = HuggingFaceEmbeddings(
model_name="Qwen/Qwen3-Embedding-0.6B",
model_kwargs={"device": "cuda", "local_files_only": True},
cache_folder=EMBEDDING_DIR,
)
After:
from remote_embedding import RemoteEmbeddings
embed_model = RemoteEmbeddings(
base_url="http://127.0.0.1:5055",
model_name="Qwen/Qwen3-Embedding-0.6B",
embedding_dir="C:/models/cache",
encode_kwargs={"normalize_embeddings": True},
)
This makes it easier for multiple RAG applications, workers, or services to share the same loaded embedding model instead of each loading its own copy into GPU memory.
Build For PyPI
Build distributions locally:
python -m pip install --upgrade build
python -m build
This creates:
dist/*.tar.gzdist/*.whl
Upload with Twine:
python -m pip install --upgrade twine
python -m twine upload dist/*
Contributing
Contributions are welcome through issues and pull requests.
Typical local workflow:
git clone git@github.com:MeshkatShB/remote-embedding.git
cd remote-embedding
python -m pip install --upgrade build
python -m build
If you change packaging metadata, rebuild dist/ before opening a release-oriented pull request.
License
This project is licensed under the MIT License. See LICENSE for the full text.
Citation
If you use this project in research, infrastructure, or published work, cite the repository:
@software{bagheri_remote_embedding_2026,
author = {Bagheri, Meshkat Shariat},
title = {remote-embedding},
year = {2026},
url = {https://github.com/MeshkatShB/remote-embedding}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file remote_embedding-0.2.0.tar.gz.
File metadata
- Download URL: remote_embedding-0.2.0.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06adc6c4880f7c990f4708d2712ffa1b1d812573e80908ffe1394cb0bb3b72f4
|
|
| MD5 |
57a5e4efee7d7a0ca6466ad37eee5322
|
|
| BLAKE2b-256 |
901eda8e84079567be7356c341d02cfc45cf7f45ea8cd8cd416fc25b40deeddf
|
File details
Details for the file remote_embedding-0.2.0-py3-none-any.whl.
File metadata
- Download URL: remote_embedding-0.2.0-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0443c2a486e7b79f1fca2d0981708af8cd58c2f96a334ba53c742ada40e15e8a
|
|
| MD5 |
e87b640bc36ca9a37c06ca92fe412c8c
|
|
| BLAKE2b-256 |
2d50f46989d9af1d1602938ee68473f37e7103c0a1a89d25e757b37fc9dde3e7
|