A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs.

These details have not been verified by PyPI

Project links

Project description

remote-embedding

remote-embedding packages two things together:

A FastAPI server that exposes a /embed API backed by local Hugging Face models.
A LangChain-compatible RemoteEmbeddings client that calls that server remotely.

This lets multiple applications share a single loaded embedding model instance instead of each process loading its own copy. On constrained GPUs, that reduces duplicated VRAM usage and makes it easier to serve embeddings from limited hardware.

Install

pip install remote-embedding

Package Layout

The import package is remote_embedding.

from remote_embedding import RemoteEmbeddings

Run The Server

Set the environment variables your model needs. You can copy values from .env.example into your own .env file, or set them directly in the shell.

PowerShell:

$env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
$env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
$env:DEVICE="cpu"
$env:MAX_LOADED_MODELS="1"
$env:MAX_INPUTS_PER_REQUEST="128"
$env:EMBEDDING_BATCH_SIZE="32"
$env:CLEAR_CUDA_CACHE_AFTER_REQUEST="true"
$env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
$env:ENCODE_KWARGS='{"normalize_embeddings": true}'

Bash:

export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
export EMBEDDING_DIR=/path/to/model-cache
export DEVICE=cpu
export MAX_LOADED_MODELS=1
export MAX_INPUTS_PER_REQUEST=128
export EMBEDDING_BATCH_SIZE=32
export CLEAR_CUDA_CACHE_AFTER_REQUEST=true
export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
export ENCODE_KWARGS='{"normalize_embeddings": true}'

You can also configure the server with CLI flags:

remote-embedding-server \
  --host 0.0.0.0 \
  --port 5055 \
  --model-name BAAI/bge-base-en-v1.5 \
  --embedding-dir /path/to/model-cache \
  --device cuda \
  --max-loaded-models 1 \
  --max-inputs-per-request 128 \
  --embedding-batch-size 32 \
  --clear-cuda-cache-after-request \
  --model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
  --encode-kwargs '{"normalize_embeddings": true}'

Start the API:

remote-embedding-server

Or:

python -m remote_embedding

Defaults:

HOST=0.0.0.0
PORT=5055

CLI flags override environment variables for the current process.

Configuration

Server configuration:

HOST: bind address for the FastAPI server
PORT: bind port for the FastAPI server
EMBEDDING_MODEL_NAME: default model to preload and use when a request does not pass model_name
EMBEDDING_DIR: optional local cache/model directory for Hugging Face downloads or local files
DEVICE: device passed to HuggingFaceEmbeddings, such as cpu or cuda
MAX_LOADED_MODELS: maximum number of embedding model instances kept in memory, default 1
MAX_INPUTS_PER_REQUEST: maximum number of strings accepted in one /embed request, default 128
EMBEDDING_BATCH_SIZE: default encoder batch_size, default 32
CLEAR_CUDA_CACHE_AFTER_REQUEST: clears unused CUDA allocator memory after each embedding request, default true
MODEL_KWARGS: JSON object merged into HuggingFaceEmbeddings(..., model_kwargs=...)
ENCODE_KWARGS: JSON object passed to HuggingFaceEmbeddings(..., encode_kwargs=...)

Client configuration through RemoteEmbeddings(...):

base_url: full server URL, such as http://127.0.0.1:5055
timeout: request timeout in seconds
expected_dimensions: optional validation for returned vector size
model_name: optional per-client default model name sent with each request
embedding_dir: optional per-client cache/model directory override sent with each request
model_kwargs: optional JSON-serializable dict sent to the server and merged into model_kwargs
encode_kwargs: optional JSON-serializable dict sent to the server as encode_kwargs

Call embeddings.close() when you are done with a long-lived client, or use RemoteEmbeddings as a context manager. This closes the client's HTTP connection pool. GPU memory is owned by the server process and is released when models are evicted or the server shuts down.

If EMBEDDING_MODEL_NAME is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.

model_kwargs and encode_kwargs become part of the server-side model cache key. Different combinations can create different embedding instances. The server evicts older instances once MAX_LOADED_MODELS is exceeded, and defaults to keeping one model loaded to protect GPU memory.

Use The Client

from remote_embedding import RemoteEmbeddings

embeddings = RemoteEmbeddings(
    base_url="http://127.0.0.1:5055",
    timeout=120,
    expected_dimensions=768,
    model_name="BAAI/bge-base-en-v1.5",
    embedding_dir="C:/models/cache",
    model_kwargs={"local_files_only": True, "trust_remote_code": True},
    encode_kwargs={"normalize_embeddings": True},
)

docs = embeddings.embed_documents(["hello world", "remote embeddings"])
query = embeddings.embed_query("search text")
embeddings.close()

Or:

from remote_embedding import RemoteEmbeddings

with RemoteEmbeddings(base_url="http://127.0.0.1:5055") as embeddings:
    docs = embeddings.embed_documents(["hello world", "remote embeddings"])

RAG Pipeline Usage

If your RAG pipeline currently loads a local embedding model inside each application process, you can replace that with RemoteEmbeddings and route embedding calls to one shared server.

Before:

from langchain_huggingface import HuggingFaceEmbeddings

embed_model = HuggingFaceEmbeddings(
    model_name="Qwen/Qwen3-Embedding-0.6B",
    model_kwargs={"device": "cuda", "local_files_only": True},
    cache_folder=EMBEDDING_DIR,
)

After:

from remote_embedding import RemoteEmbeddings

embed_model = RemoteEmbeddings(
    base_url="http://127.0.0.1:5055",
    model_name="Qwen/Qwen3-Embedding-0.6B",
    embedding_dir="C:/models/cache",
    encode_kwargs={"normalize_embeddings": True},
)

This makes it easier for multiple RAG applications, workers, or services to share the same loaded embedding model instead of each loading its own copy into GPU memory.

Build For PyPI

Build distributions locally:

python -m pip install --upgrade build
python -m build

This creates:

dist/*.tar.gz
dist/*.whl

Upload with Twine:

python -m pip install --upgrade twine
python -m twine upload dist/*

Contributing

Contributions are welcome through issues and pull requests.

Typical local workflow:

git clone git@github.com:MeshkatShB/remote-embedding.git
cd remote-embedding
python -m pip install --upgrade build
python -m build

If you change packaging metadata, rebuild dist/ before opening a release-oriented pull request.

License

This project is licensed under the MIT License. See LICENSE for the full text.

Citation

If you use this project in research, infrastructure, or published work, cite the repository:

@software{bagheri_remote_embedding_2026,
  author = {Bagheri, Meshkat Shariat},
  title = {remote-embedding},
  year = {2026},
  url = {https://github.com/MeshkatShB/remote-embedding}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.1

May 23, 2026

0.3.0

Apr 20, 2026

0.2.1

Apr 19, 2026

0.2.0

Apr 19, 2026

0.1.0

Apr 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

remote_embedding-0.3.1.tar.gz (13.5 kB view details)

Uploaded May 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

remote_embedding-0.3.1-py3-none-any.whl (11.5 kB view details)

Uploaded May 23, 2026 Python 3

File details

Details for the file remote_embedding-0.3.1.tar.gz.

File metadata

Download URL: remote_embedding-0.3.1.tar.gz
Upload date: May 23, 2026
Size: 13.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for remote_embedding-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`837af40a93215f0c9ceb1fd16a308459f324f5dedd6b0b1e51aad38b24ca390c`
MD5	`3062b2208ed4385b58d3ba1a232330ef`
BLAKE2b-256	`9aa6237c28a01ed9c4923262cadbdc716e2542bdde0183abe09d07446215c5f4`

See more details on using hashes here.

File details

Details for the file remote_embedding-0.3.1-py3-none-any.whl.

File metadata

Download URL: remote_embedding-0.3.1-py3-none-any.whl
Upload date: May 23, 2026
Size: 11.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for remote_embedding-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b635c6c5a043c33570eb24486fa2b4477a78dc76983fab453dc879d9ac065c01`
MD5	`9ae8cc142f8f55e76f80c0e2bb42bc4e`
BLAKE2b-256	`ce5d88686a09280297682749d095082259e65964bab01b79af31672648613640`

See more details on using hashes here.

remote-embedding 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

remote-embedding

Install

Package Layout

Run The Server

Configuration

Use The Client

RAG Pipeline Usage

Build For PyPI

Contributing

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes