Skip to main content

A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs.

Project description

remote-embedding

remote-embedding packages two things together:

  • A FastAPI server that exposes a /embed API backed by local Hugging Face models.
  • A LangChain-compatible RemoteEmbeddings client that calls that server remotely.

This lets multiple applications share a single loaded embedding model instance instead of each process loading its own copy. On constrained GPUs, that reduces duplicated VRAM usage and makes it easier to serve embeddings from limited hardware.

Install

pip install remote-embedding

Package Layout

The import package is remote_embedding.

from remote_embedding import RemoteEmbeddings

Run The Server

Set the environment variables your model needs. You can copy values from .env.example into your own .env file, or set them directly in the shell.

PowerShell:

$env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
$env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
$env:DEVICE="cpu"
$env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
$env:ENCODE_KWARGS='{"normalize_embeddings": true}'

Bash:

export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
export EMBEDDING_DIR=/path/to/model-cache
export DEVICE=cpu
export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
export ENCODE_KWARGS='{"normalize_embeddings": true}'

You can also configure the server with CLI flags:

remote-embedding-server \
  --host 0.0.0.0 \
  --port 5055 \
  --model-name BAAI/bge-base-en-v1.5 \
  --embedding-dir /path/to/model-cache \
  --device cuda \
  --model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
  --encode-kwargs '{"normalize_embeddings": true}'

Start the API:

remote-embedding-server

Or:

python -m remote_embedding

Defaults:

  • HOST=0.0.0.0
  • PORT=5055

CLI flags override environment variables for the current process.

Configuration

Server configuration:

  • HOST: bind address for the FastAPI server
  • PORT: bind port for the FastAPI server
  • EMBEDDING_MODEL_NAME: default model to preload and use when a request does not pass model_name
  • EMBEDDING_DIR: optional local cache/model directory for Hugging Face downloads or local files
  • DEVICE: device passed to HuggingFaceEmbeddings, such as cpu or cuda
  • MODEL_KWARGS: JSON object merged into HuggingFaceEmbeddings(..., model_kwargs=...)
  • ENCODE_KWARGS: JSON object passed to HuggingFaceEmbeddings(..., encode_kwargs=...)

Client configuration through RemoteEmbeddings(...):

  • base_url: full server URL, such as http://127.0.0.1:5055
  • timeout: request timeout in seconds
  • expected_dimensions: optional validation for returned vector size
  • model_name: optional per-client default model name sent with each request
  • embedding_dir: optional per-client cache/model directory override sent with each request
  • model_kwargs: optional JSON-serializable dict sent to the server and merged into model_kwargs
  • encode_kwargs: optional JSON-serializable dict sent to the server as encode_kwargs

If EMBEDDING_MODEL_NAME is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.

model_kwargs and encode_kwargs become part of the server-side model cache key. That means different combinations can create different loaded embedding instances, which is flexible but can reduce the VRAM-sharing benefit if overused.

Use The Client

from remote_embedding import RemoteEmbeddings

embeddings = RemoteEmbeddings(
    base_url="http://127.0.0.1:5055",
    timeout=120,
    expected_dimensions=768,
    model_name="BAAI/bge-base-en-v1.5",
    embedding_dir="C:/models/cache",
    model_kwargs={"local_files_only": True, "trust_remote_code": True},
    encode_kwargs={"normalize_embeddings": True},
)

docs = embeddings.embed_documents(["hello world", "remote embeddings"])
query = embeddings.embed_query("search text")

RAG Pipeline Usage

If your RAG pipeline currently loads a local embedding model inside each application process, you can replace that with RemoteEmbeddings and route embedding calls to one shared server.

Before:

from langchain_huggingface import HuggingFaceEmbeddings

embed_model = HuggingFaceEmbeddings(
    model_name="Qwen/Qwen3-Embedding-0.6B",
    model_kwargs={"device": "cuda", "local_files_only": True},
    cache_folder=EMBEDDING_DIR,
)

After:

from remote_embedding import RemoteEmbeddings

embed_model = RemoteEmbeddings(
    base_url="http://127.0.0.1:5055",
    model_name="Qwen/Qwen3-Embedding-0.6B",
    embedding_dir="C:/models/cache",
    encode_kwargs={"normalize_embeddings": True},
)

This makes it easier for multiple RAG applications, workers, or services to share the same loaded embedding model instead of each loading its own copy into GPU memory.

Build For PyPI

Build distributions locally:

python -m pip install --upgrade build
python -m build

This creates:

  • dist/*.tar.gz
  • dist/*.whl

Upload with Twine:

python -m pip install --upgrade twine
python -m twine upload dist/*

Contributing

Contributions are welcome through issues and pull requests.

Typical local workflow:

git clone git@github.com:MeshkatShB/remote-embedding.git
cd remote-embedding
python -m pip install --upgrade build
python -m build

If you change packaging metadata, rebuild dist/ before opening a release-oriented pull request.

License

This project is licensed under the MIT License. See LICENSE for the full text.

Citation

If you use this project in research, infrastructure, or published work, cite the repository:

@software{bagheri_remote_embedding_2026,
  author = {Bagheri, Meshkat Shariat},
  title = {remote-embedding},
  year = {2026},
  url = {https://github.com/MeshkatShB/remote-embedding}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

remote_embedding-0.2.0.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

remote_embedding-0.2.0-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file remote_embedding-0.2.0.tar.gz.

File metadata

  • Download URL: remote_embedding-0.2.0.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for remote_embedding-0.2.0.tar.gz
Algorithm Hash digest
SHA256 06adc6c4880f7c990f4708d2712ffa1b1d812573e80908ffe1394cb0bb3b72f4
MD5 57a5e4efee7d7a0ca6466ad37eee5322
BLAKE2b-256 901eda8e84079567be7356c341d02cfc45cf7f45ea8cd8cd416fc25b40deeddf

See more details on using hashes here.

File details

Details for the file remote_embedding-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for remote_embedding-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0443c2a486e7b79f1fca2d0981708af8cd58c2f96a334ba53c742ada40e15e8a
MD5 e87b640bc36ca9a37c06ca92fe412c8c
BLAKE2b-256 2d50f46989d9af1d1602938ee68473f37e7103c0a1a89d25e757b37fc9dde3e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page