No project description provided

Project description

embedding-ingestion

embedding-ingestion is a packaged ingestion pipeline that loads a text dataset from MinIO, generates embeddings through an OpenAI-compatible embedding endpoint, and stores vectors in Qdrant.

The project is intentionally config-driven. The runner, loader, dataset, embedder, and processor classes are all resolved from module_path values, which makes the package reusable across datasets and model backends without changing application code.

What it does

Current built-in pipeline:

Load a TextDataset subclass from MinIO.
Convert dataset rows into LangChain Document objects.
Generate embeddings asynchronously.
Recreate a Qdrant collection.
Upsert document vectors and metadata into Qdrant.
Verify that points were written successfully.

Core components:

MinioLoader in src/embedding_ingestion/loaders.py
QdrantStore in src/embedding_ingestion/store.py
MinioVLLMQdrantRunner in src/embedding_ingestion/runners.py
CLI entrypoint in src/embedding_ingestion/main.py

Architecture

The package defines three extension points:

DocumentLoader: responsible for fetching and optionally filtering documents.
Store: responsible for embedding, persistence, and post-write verification.
Runner: orchestrates the end-to-end ingestion flow.

At runtime, the CLI resolves the runner from the YAML config file at /config/config.yaml, then instantiates nested loader/store settings through Pydantic models.

Requirements

Python >=3.11,<3.13
A reachable MinIO-compatible object store
A reachable Qdrant instance
An OpenAI-compatible embeddings endpoint
A dataset class that subclasses retrievalbase.dataset.TextDataset

Installation

Local

Production dependencies:

make install

Development environment:

make dev-install

If you prefer using uv directly:

uv sync --group dev --all-extras

Configuration

The application expects a YAML config at /config/config.yaml by default.

Example:

module_path: embedding_ingestion.runners.MinioVLLMQdrantRunner

loader:
  module_path: embedding_ingestion.loaders.MinioLoader
  endpoint: minio:9000
  access_key: ${MINIO_ACCESS_KEY}
  secret_key: ${MINIO_SECRET_KEY}
  bucket: datasets
  dataset_module_path: your_project.datasets.MyTextDataset
  dataset_minio_path: corpora/my-dataset.parquet

store:
  module_path: embedding_ingestion.store.QdrantStore
  url: http://qdrant:6333
  collection_name: my_embeddings
  distance: cosine
  embedder:
    module_path: retrievalbase.evaluation.openai_compatible.OpenAICompatibleEmbedder
    model_name: text-embedding-3-large
    base_url: http://vllm:8000/v1
  processor:
    module_path: retrievalbase.evaluation.nomic.NomicProcessor

Running

CLI

After the config file is mounted or created at /config/config.yaml:

embedding-ingestion

Equivalent:

python -m embedding_ingestion.main

Docker

Build:

docker build -t embedding-ingestion .

Run:

docker run --rm \
  -v /absolute/path/to/config:/config \
  embedding-ingestion

The image entrypoint runs:

python -m embedding_ingestion.main

Best practices

Treat ingestion as destructive by default

QdrantStore.create() deletes and recreates the target collection before writing. Use a dedicated collection per run or environment, and do not point this job at a production collection unless full replacement is intended.

Use stable dataset classes

dataset_module_path must point to a concrete TextDataset subclass with a working from_minio(...) implementation. Keep that class in a versioned package so ingest behavior remains reproducible.

Validate non-app dependencies before running

Before executing the pipeline, confirm:

the MinIO bucket and object key exist
the Qdrant URL is reachable
the embedding endpoint is healthy
the selected embedding model returns the expected vector dimension

Watch for deduplication behavior

The built-in loader groups rows by page_content and keeps the first metadata entry. If duplicate text with different metadata matters in your use case, change that behavior before relying on the default loader.

Make runs idempotent where possible

Point IDs are generated deterministically from page_content and metadata. That is good for stable re-ingestion semantics, but only if your upstream dataset normalization is also stable.

Extending the package

Add a custom implementation when you need a different source or vector store:

subclass DocumentLoader for a new ingestion source
subclass Store for a new destination
subclass Runner if orchestration needs to change

Then point the YAML module_path values at your custom classes.

Repository layout

src/embedding_ingestion/
  __init__.py      # abstract loader/store/runner contracts
  loaders.py       # MinIO dataset loader
  store.py         # Qdrant embedding store
  runners.py       # concrete pipeline runner
  settings.py      # config models
  utils.py         # runner loading and deterministic document IDs
  main.py          # CLI entrypoint

Project details

Release history Release notifications | RSS feed

This version

1.0.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedding_ingestion-1.0.0.tar.gz (115.3 kB view details)

Uploaded Apr 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

embedding_ingestion-1.0.0-py3-none-any.whl (9.1 kB view details)

Uploaded Apr 23, 2026 Python 3

File details

Details for the file embedding_ingestion-1.0.0.tar.gz.

File metadata

Download URL: embedding_ingestion-1.0.0.tar.gz
Upload date: Apr 23, 2026
Size: 115.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for embedding_ingestion-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`334b66e2612f32fbaa2c407df77f60673ddb988a23de8872b4b2b5784bbe7542`
MD5	`24ba9c82c627e63c59dca8a964a5b95a`
BLAKE2b-256	`c0f5fbd51957d70bc3929e7e224e063b997c4375721681b99d96243b2ee01302`

See more details on using hashes here.

File details

Details for the file embedding_ingestion-1.0.0-py3-none-any.whl.

File metadata

Download URL: embedding_ingestion-1.0.0-py3-none-any.whl
Upload date: Apr 23, 2026
Size: 9.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for embedding_ingestion-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`de44bccd5f17fd42bc7025b7886a46abde570ec2f989f2a5bd3ab19849dc70fb`
MD5	`d033f74b6ea60655bcd325c005c0b749`
BLAKE2b-256	`3fd8e9f9a546dfc1e88a171c04b8241d96c3d4aec82669855dc7a1540126db47`

See more details on using hashes here.

embedding-ingestion 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

embedding-ingestion

What it does

Architecture

Requirements

Installation

Local

Configuration

Running

CLI

Docker

Best practices

Treat ingestion as destructive by default

Use stable dataset classes

Validate non-app dependencies before running

Watch for deduplication behavior

Make runs idempotent where possible

Extending the package

Repository layout

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes