Skip to main content

A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs.

Project description

remote-embedding

remote-embedding packages two things together:

  • A FastAPI server that exposes a /embed API backed by local Hugging Face models.
  • A LangChain-compatible RemoteEmbeddings client that calls that server remotely.

This lets multiple applications share a single loaded embedding model instance instead of each process loading its own copy. On constrained GPUs, that reduces duplicated VRAM usage and makes it easier to serve embeddings from limited hardware.

Install

pip install remote-embedding

Package Layout

The import package is remote_embedding.

from remote_embedding import RemoteEmbeddings

Run The Server

Set the environment variables your model needs. You can copy values from .env.example into your own .env file, or set them directly in the shell.

PowerShell:

$env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
$env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
$env:DEVICE="cpu"
$env:MAX_LOADED_MODELS="1"
$env:MAX_INPUTS_PER_REQUEST="128"
$env:EMBEDDING_BATCH_SIZE="32"
$env:CLEAR_CUDA_CACHE_AFTER_REQUEST="true"
$env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
$env:ENCODE_KWARGS='{"normalize_embeddings": true}'

Bash:

export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
export EMBEDDING_DIR=/path/to/model-cache
export DEVICE=cpu
export MAX_LOADED_MODELS=1
export MAX_INPUTS_PER_REQUEST=128
export EMBEDDING_BATCH_SIZE=32
export CLEAR_CUDA_CACHE_AFTER_REQUEST=true
export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
export ENCODE_KWARGS='{"normalize_embeddings": true}'

You can also configure the server with CLI flags:

remote-embedding-server \
  --host 0.0.0.0 \
  --port 5055 \
  --model-name BAAI/bge-base-en-v1.5 \
  --embedding-dir /path/to/model-cache \
  --device cuda \
  --max-loaded-models 1 \
  --max-inputs-per-request 128 \
  --embedding-batch-size 32 \
  --clear-cuda-cache-after-request \
  --model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
  --encode-kwargs '{"normalize_embeddings": true}'

Start the API:

remote-embedding-server

Or:

python -m remote_embedding

Defaults:

  • HOST=0.0.0.0
  • PORT=5055

CLI flags override environment variables for the current process.

Configuration

Server configuration:

  • HOST: bind address for the FastAPI server
  • PORT: bind port for the FastAPI server
  • EMBEDDING_MODEL_NAME: default model to preload and use when a request does not pass model_name
  • EMBEDDING_DIR: optional local cache/model directory for Hugging Face downloads or local files
  • DEVICE: device passed to HuggingFaceEmbeddings, such as cpu or cuda
  • MAX_LOADED_MODELS: maximum number of embedding model instances kept in memory, default 1
  • MAX_INPUTS_PER_REQUEST: maximum number of strings accepted in one /embed request, default 128
  • EMBEDDING_BATCH_SIZE: default encoder batch_size, default 32
  • CLEAR_CUDA_CACHE_AFTER_REQUEST: clears unused CUDA allocator memory after each embedding request, default true
  • MODEL_KWARGS: JSON object merged into HuggingFaceEmbeddings(..., model_kwargs=...)
  • ENCODE_KWARGS: JSON object passed to HuggingFaceEmbeddings(..., encode_kwargs=...)

Client configuration through RemoteEmbeddings(...):

  • base_url: full server URL, such as http://127.0.0.1:5055
  • timeout: request timeout in seconds
  • expected_dimensions: optional validation for returned vector size
  • model_name: optional per-client default model name sent with each request
  • embedding_dir: optional per-client cache/model directory override sent with each request
  • model_kwargs: optional JSON-serializable dict sent to the server and merged into model_kwargs
  • encode_kwargs: optional JSON-serializable dict sent to the server as encode_kwargs

Call embeddings.close() when you are done with a long-lived client, or use RemoteEmbeddings as a context manager. This closes the client's HTTP connection pool. GPU memory is owned by the server process and is released when models are evicted or the server shuts down.

If EMBEDDING_MODEL_NAME is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.

model_kwargs and encode_kwargs become part of the server-side model cache key. Different combinations can create different embedding instances. The server evicts older instances once MAX_LOADED_MODELS is exceeded, and defaults to keeping one model loaded to protect GPU memory.

Use The Client

from remote_embedding import RemoteEmbeddings

embeddings = RemoteEmbeddings(
    base_url="http://127.0.0.1:5055",
    timeout=120,
    expected_dimensions=768,
    model_name="BAAI/bge-base-en-v1.5",
    embedding_dir="C:/models/cache",
    model_kwargs={"local_files_only": True, "trust_remote_code": True},
    encode_kwargs={"normalize_embeddings": True},
)

docs = embeddings.embed_documents(["hello world", "remote embeddings"])
query = embeddings.embed_query("search text")
embeddings.close()

Or:

from remote_embedding import RemoteEmbeddings

with RemoteEmbeddings(base_url="http://127.0.0.1:5055") as embeddings:
    docs = embeddings.embed_documents(["hello world", "remote embeddings"])

RAG Pipeline Usage

If your RAG pipeline currently loads a local embedding model inside each application process, you can replace that with RemoteEmbeddings and route embedding calls to one shared server.

Before:

from langchain_huggingface import HuggingFaceEmbeddings

embed_model = HuggingFaceEmbeddings(
    model_name="Qwen/Qwen3-Embedding-0.6B",
    model_kwargs={"device": "cuda", "local_files_only": True},
    cache_folder=EMBEDDING_DIR,
)

After:

from remote_embedding import RemoteEmbeddings

embed_model = RemoteEmbeddings(
    base_url="http://127.0.0.1:5055",
    model_name="Qwen/Qwen3-Embedding-0.6B",
    embedding_dir="C:/models/cache",
    encode_kwargs={"normalize_embeddings": True},
)

This makes it easier for multiple RAG applications, workers, or services to share the same loaded embedding model instead of each loading its own copy into GPU memory.

Build For PyPI

Build distributions locally:

python -m pip install --upgrade build
python -m build

This creates:

  • dist/*.tar.gz
  • dist/*.whl

Upload with Twine:

python -m pip install --upgrade twine
python -m twine upload dist/*

Contributing

Contributions are welcome through issues and pull requests.

Typical local workflow:

git clone git@github.com:MeshkatShB/remote-embedding.git
cd remote-embedding
python -m pip install --upgrade build
python -m build

If you change packaging metadata, rebuild dist/ before opening a release-oriented pull request.

License

This project is licensed under the MIT License. See LICENSE for the full text.

Citation

If you use this project in research, infrastructure, or published work, cite the repository:

@software{bagheri_remote_embedding_2026,
  author = {Bagheri, Meshkat Shariat},
  title = {remote-embedding},
  year = {2026},
  url = {https://github.com/MeshkatShB/remote-embedding}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

remote_embedding-0.3.1.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

remote_embedding-0.3.1-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file remote_embedding-0.3.1.tar.gz.

File metadata

  • Download URL: remote_embedding-0.3.1.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for remote_embedding-0.3.1.tar.gz
Algorithm Hash digest
SHA256 837af40a93215f0c9ceb1fd16a308459f324f5dedd6b0b1e51aad38b24ca390c
MD5 3062b2208ed4385b58d3ba1a232330ef
BLAKE2b-256 9aa6237c28a01ed9c4923262cadbdc716e2542bdde0183abe09d07446215c5f4

See more details on using hashes here.

File details

Details for the file remote_embedding-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for remote_embedding-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b635c6c5a043c33570eb24486fa2b4477a78dc76983fab453dc879d9ac065c01
MD5 9ae8cc142f8f55e76f80c0e2bb42bc4e
BLAKE2b-256 ce5d88686a09280297682749d095082259e65964bab01b79af31672648613640

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page