Skip to main content

A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs.

Project description

remote-embedding

remote-embedding packages two things together:

  • A FastAPI server that exposes a /embed API backed by local Hugging Face models.
  • A LangChain-compatible RemoteEmbeddings client that calls that server remotely.

This lets multiple applications share a single loaded embedding model instance instead of each process loading its own copy. On constrained GPUs, that reduces duplicated VRAM usage and makes it easier to serve embeddings from limited hardware.

Install

pip install remote-embedding

Package Layout

The import package is remote_embedding.

from remote_embedding import RemoteEmbeddings

Run The Server

Set the environment variables your model needs. You can copy values from .env.example into your own .env file, or set them directly in the shell.

PowerShell:

$env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
$env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
$env:DEVICE="cpu"
$env:MAX_LOADED_MODELS="1"
$env:MAX_INPUTS_PER_REQUEST="128"
$env:EMBEDDING_BATCH_SIZE="32"
$env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
$env:ENCODE_KWARGS='{"normalize_embeddings": true}'

Bash:

export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
export EMBEDDING_DIR=/path/to/model-cache
export DEVICE=cpu
export MAX_LOADED_MODELS=1
export MAX_INPUTS_PER_REQUEST=128
export EMBEDDING_BATCH_SIZE=32
export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
export ENCODE_KWARGS='{"normalize_embeddings": true}'

You can also configure the server with CLI flags:

remote-embedding-server \
  --host 0.0.0.0 \
  --port 5055 \
  --model-name BAAI/bge-base-en-v1.5 \
  --embedding-dir /path/to/model-cache \
  --device cuda \
  --max-loaded-models 1 \
  --max-inputs-per-request 128 \
  --embedding-batch-size 32 \
  --model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
  --encode-kwargs '{"normalize_embeddings": true}'

Start the API:

remote-embedding-server

Or:

python -m remote_embedding

Defaults:

  • HOST=0.0.0.0
  • PORT=5055

CLI flags override environment variables for the current process.

Configuration

Server configuration:

  • HOST: bind address for the FastAPI server
  • PORT: bind port for the FastAPI server
  • EMBEDDING_MODEL_NAME: default model to preload and use when a request does not pass model_name
  • EMBEDDING_DIR: optional local cache/model directory for Hugging Face downloads or local files
  • DEVICE: device passed to HuggingFaceEmbeddings, such as cpu or cuda
  • MAX_LOADED_MODELS: maximum number of embedding model instances kept in memory, default 1
  • MAX_INPUTS_PER_REQUEST: maximum number of strings accepted in one /embed request, default 128
  • EMBEDDING_BATCH_SIZE: default encoder batch_size, default 32
  • MODEL_KWARGS: JSON object merged into HuggingFaceEmbeddings(..., model_kwargs=...)
  • ENCODE_KWARGS: JSON object passed to HuggingFaceEmbeddings(..., encode_kwargs=...)

Client configuration through RemoteEmbeddings(...):

  • base_url: full server URL, such as http://127.0.0.1:5055
  • timeout: request timeout in seconds
  • expected_dimensions: optional validation for returned vector size
  • model_name: optional per-client default model name sent with each request
  • embedding_dir: optional per-client cache/model directory override sent with each request
  • model_kwargs: optional JSON-serializable dict sent to the server and merged into model_kwargs
  • encode_kwargs: optional JSON-serializable dict sent to the server as encode_kwargs

If EMBEDDING_MODEL_NAME is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.

model_kwargs and encode_kwargs become part of the server-side model cache key. Different combinations can create different embedding instances. The server evicts older instances once MAX_LOADED_MODELS is exceeded, and defaults to keeping one model loaded to protect GPU memory.

Use The Client

from remote_embedding import RemoteEmbeddings

embeddings = RemoteEmbeddings(
    base_url="http://127.0.0.1:5055",
    timeout=120,
    expected_dimensions=768,
    model_name="BAAI/bge-base-en-v1.5",
    embedding_dir="C:/models/cache",
    model_kwargs={"local_files_only": True, "trust_remote_code": True},
    encode_kwargs={"normalize_embeddings": True},
)

docs = embeddings.embed_documents(["hello world", "remote embeddings"])
query = embeddings.embed_query("search text")

RAG Pipeline Usage

If your RAG pipeline currently loads a local embedding model inside each application process, you can replace that with RemoteEmbeddings and route embedding calls to one shared server.

Before:

from langchain_huggingface import HuggingFaceEmbeddings

embed_model = HuggingFaceEmbeddings(
    model_name="Qwen/Qwen3-Embedding-0.6B",
    model_kwargs={"device": "cuda", "local_files_only": True},
    cache_folder=EMBEDDING_DIR,
)

After:

from remote_embedding import RemoteEmbeddings

embed_model = RemoteEmbeddings(
    base_url="http://127.0.0.1:5055",
    model_name="Qwen/Qwen3-Embedding-0.6B",
    embedding_dir="C:/models/cache",
    encode_kwargs={"normalize_embeddings": True},
)

This makes it easier for multiple RAG applications, workers, or services to share the same loaded embedding model instead of each loading its own copy into GPU memory.

Build For PyPI

Build distributions locally:

python -m pip install --upgrade build
python -m build

This creates:

  • dist/*.tar.gz
  • dist/*.whl

Upload with Twine:

python -m pip install --upgrade twine
python -m twine upload dist/*

Contributing

Contributions are welcome through issues and pull requests.

Typical local workflow:

git clone git@github.com:MeshkatShB/remote-embedding.git
cd remote-embedding
python -m pip install --upgrade build
python -m build

If you change packaging metadata, rebuild dist/ before opening a release-oriented pull request.

License

This project is licensed under the MIT License. See LICENSE for the full text.

Citation

If you use this project in research, infrastructure, or published work, cite the repository:

@software{bagheri_remote_embedding_2026,
  author = {Bagheri, Meshkat Shariat},
  title = {remote-embedding},
  year = {2026},
  url = {https://github.com/MeshkatShB/remote-embedding}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

remote_embedding-0.3.0.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

remote_embedding-0.3.0-py3-none-any.whl (10.8 kB view details)

Uploaded Python 3

File details

Details for the file remote_embedding-0.3.0.tar.gz.

File metadata

  • Download URL: remote_embedding-0.3.0.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for remote_embedding-0.3.0.tar.gz
Algorithm Hash digest
SHA256 69f00e646dfd161d405f200ac623cb5a71c9917446690842cdb25293613b0793
MD5 deb75d5aad0ac7b3d804da4ed2199295
BLAKE2b-256 fddec9ec6e98c37ba95b797d04b01cdc87c80e450cc19d5bc17af451493ce647

See more details on using hashes here.

Provenance

The following attestation bundles were made for remote_embedding-0.3.0.tar.gz:

Publisher: publish.yml on MeshkatShB/remote-embedding

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file remote_embedding-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for remote_embedding-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c6cc3fb3841c548b65a1047c868568b8d3e0a3699ffa964ed858e274e7e2c898
MD5 4e0aaff266099d3c7ac04c6697351936
BLAKE2b-256 b8b6e039372811846df14a8f8d3303dd9a906b1a01dbbf348b8113852fe7157a

See more details on using hashes here.

Provenance

The following attestation bundles were made for remote_embedding-0.3.0-py3-none-any.whl:

Publisher: publish.yml on MeshkatShB/remote-embedding

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page