No project description provided
Project description
embedding-ingestion
embedding-ingestion is a packaged ingestion pipeline that loads a text dataset from MinIO, generates embeddings through an OpenAI-compatible embedding endpoint, and stores vectors in Qdrant.
The project is intentionally config-driven. The runner, loader, dataset, embedder, and processor classes are all resolved from module_path values, which makes the package reusable across datasets and model backends without changing application code.
What it does
Current built-in pipeline:
- Load a
TextDatasetsubclass from MinIO. - Convert dataset rows into LangChain
Documentobjects. - Generate embeddings asynchronously.
- Recreate a Qdrant collection.
- Upsert document vectors and metadata into Qdrant.
- Verify that points were written successfully.
Core components:
MinioLoaderinsrc/embedding_ingestion/loaders.pyQdrantStoreinsrc/embedding_ingestion/store.pyMinioVLLMQdrantRunnerinsrc/embedding_ingestion/runners.py- CLI entrypoint in
src/embedding_ingestion/main.py
Architecture
The package defines three extension points:
DocumentLoader: responsible for fetching and optionally filtering documents.Store: responsible for embedding, persistence, and post-write verification.Runner: orchestrates the end-to-end ingestion flow.
At runtime, the CLI resolves the runner from the YAML config file at /config/config.yaml, then instantiates nested loader/store settings through Pydantic models.
Requirements
- Python
>=3.11,<3.13 - A reachable MinIO-compatible object store
- A reachable Qdrant instance
- An OpenAI-compatible embeddings endpoint
- A dataset class that subclasses
retrievalbase.dataset.TextDataset
Installation
Local
Production dependencies:
make install
Development environment:
make dev-install
If you prefer using uv directly:
uv sync --group dev --all-extras
Configuration
The application expects a YAML config at /config/config.yaml by default.
Example:
module_path: embedding_ingestion.runners.MinioVLLMQdrantRunner
loader:
module_path: embedding_ingestion.loaders.MinioLoader
endpoint: minio:9000
access_key: ${MINIO_ACCESS_KEY}
secret_key: ${MINIO_SECRET_KEY}
bucket: datasets
dataset_module_path: your_project.datasets.MyTextDataset
dataset_minio_path: corpora/my-dataset.parquet
store:
module_path: embedding_ingestion.store.QdrantStore
url: http://qdrant:6333
collection_name: my_embeddings
distance: cosine
embedder:
module_path: retrievalbase.evaluation.openai_compatible.OpenAICompatibleEmbedder
model_name: text-embedding-3-large
base_url: http://vllm:8000/v1
processor:
module_path: retrievalbase.evaluation.nomic.NomicProcessor
Running
CLI
After the config file is mounted or created at /config/config.yaml:
embedding-ingestion
Equivalent:
python -m embedding_ingestion.main
Docker
Build:
docker build -t embedding-ingestion .
Run:
docker run --rm \
-v /absolute/path/to/config:/config \
embedding-ingestion
The image entrypoint runs:
python -m embedding_ingestion.main
Best practices
Treat ingestion as destructive by default
QdrantStore.create() deletes and recreates the target collection before writing. Use a dedicated collection per run or environment, and do not point this job at a production collection unless full replacement is intended.
Use stable dataset classes
dataset_module_path must point to a concrete TextDataset subclass with a working from_minio(...) implementation. Keep that class in a versioned package so ingest behavior remains reproducible.
Validate non-app dependencies before running
Before executing the pipeline, confirm:
- the MinIO bucket and object key exist
- the Qdrant URL is reachable
- the embedding endpoint is healthy
- the selected embedding model returns the expected vector dimension
Watch for deduplication behavior
The built-in loader groups rows by page_content and keeps the first metadata entry. If duplicate text with different metadata matters in your use case, change that behavior before relying on the default loader.
Make runs idempotent where possible
Point IDs are generated deterministically from page_content and metadata. That is good for stable re-ingestion semantics, but only if your upstream dataset normalization is also stable.
Extending the package
Add a custom implementation when you need a different source or vector store:
- subclass
DocumentLoaderfor a new ingestion source - subclass
Storefor a new destination - subclass
Runnerif orchestration needs to change
Then point the YAML module_path values at your custom classes.
Repository layout
src/embedding_ingestion/
__init__.py # abstract loader/store/runner contracts
loaders.py # MinIO dataset loader
store.py # Qdrant embedding store
runners.py # concrete pipeline runner
settings.py # config models
utils.py # runner loading and deterministic document IDs
main.py # CLI entrypoint
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file embedding_ingestion-1.0.0.tar.gz.
File metadata
- Download URL: embedding_ingestion-1.0.0.tar.gz
- Upload date:
- Size: 115.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
334b66e2612f32fbaa2c407df77f60673ddb988a23de8872b4b2b5784bbe7542
|
|
| MD5 |
24ba9c82c627e63c59dca8a964a5b95a
|
|
| BLAKE2b-256 |
c0f5fbd51957d70bc3929e7e224e063b997c4375721681b99d96243b2ee01302
|
File details
Details for the file embedding_ingestion-1.0.0-py3-none-any.whl.
File metadata
- Download URL: embedding_ingestion-1.0.0-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de44bccd5f17fd42bc7025b7886a46abde570ec2f989f2a5bd3ab19849dc70fb
|
|
| MD5 |
d033f74b6ea60655bcd325c005c0b749
|
|
| BLAKE2b-256 |
3fd8e9f9a546dfc1e88a171c04b8241d96c3d4aec82669855dc7a1540126db47
|