Skip to main content

No project description provided

Project description

embedding-ingestion

embedding-ingestion is a packaged ingestion pipeline that loads a text dataset from MinIO, generates embeddings through an OpenAI-compatible embedding endpoint, and stores vectors in Qdrant.

The project is intentionally config-driven. The runner, loader, dataset, embedder, and processor classes are all resolved from module_path values, which makes the package reusable across datasets and model backends without changing application code.

What it does

Current built-in pipeline:

  1. Load a TextDataset subclass from MinIO.
  2. Convert dataset rows into LangChain Document objects.
  3. Generate embeddings asynchronously.
  4. Recreate a Qdrant collection.
  5. Upsert document vectors and metadata into Qdrant.
  6. Verify that points were written successfully.

Core components:

Architecture

The package defines three extension points:

  • DocumentLoader: responsible for fetching and optionally filtering documents.
  • Store: responsible for embedding, persistence, and post-write verification.
  • Runner: orchestrates the end-to-end ingestion flow.

At runtime, the CLI resolves the runner from the YAML config file at /config/config.yaml, then instantiates nested loader/store settings through Pydantic models.

Requirements

  • Python >=3.11,<3.13
  • A reachable MinIO-compatible object store
  • A reachable Qdrant instance
  • An OpenAI-compatible embeddings endpoint
  • A dataset class that subclasses retrievalbase.dataset.TextDataset

Installation

Local

Production dependencies:

make install

Development environment:

make dev-install

If you prefer using uv directly:

uv sync --group dev --all-extras

Configuration

The application expects a YAML config at /config/config.yaml by default.

Example:

module_path: embedding_ingestion.runners.MinioVLLMQdrantRunner

loader:
  module_path: embedding_ingestion.loaders.MinioLoader
  endpoint: minio:9000
  access_key: ${MINIO_ACCESS_KEY}
  secret_key: ${MINIO_SECRET_KEY}
  bucket: datasets
  dataset_module_path: your_project.datasets.MyTextDataset
  dataset_minio_path: corpora/my-dataset.parquet

store:
  module_path: embedding_ingestion.store.QdrantStore
  url: http://qdrant:6333
  collection_name: my_embeddings
  distance: cosine
  embedder:
    module_path: retrievalbase.evaluation.openai_compatible.OpenAICompatibleEmbedder
    model_name: text-embedding-3-large
    base_url: http://vllm:8000/v1
  processor:
    module_path: retrievalbase.evaluation.nomic.NomicProcessor

Running

CLI

After the config file is mounted or created at /config/config.yaml:

embedding-ingestion

Equivalent:

python -m embedding_ingestion.main

Docker

Build:

docker build -t embedding-ingestion .

Run:

docker run --rm \
  -v /absolute/path/to/config:/config \
  embedding-ingestion

The image entrypoint runs:

python -m embedding_ingestion.main

Best practices

Treat ingestion as destructive by default

QdrantStore.create() deletes and recreates the target collection before writing. Use a dedicated collection per run or environment, and do not point this job at a production collection unless full replacement is intended.

Use stable dataset classes

dataset_module_path must point to a concrete TextDataset subclass with a working from_minio(...) implementation. Keep that class in a versioned package so ingest behavior remains reproducible.

Validate non-app dependencies before running

Before executing the pipeline, confirm:

  • the MinIO bucket and object key exist
  • the Qdrant URL is reachable
  • the embedding endpoint is healthy
  • the selected embedding model returns the expected vector dimension

Watch for deduplication behavior

The built-in loader groups rows by page_content and keeps the first metadata entry. If duplicate text with different metadata matters in your use case, change that behavior before relying on the default loader.

Make runs idempotent where possible

Point IDs are generated deterministically from page_content and metadata. That is good for stable re-ingestion semantics, but only if your upstream dataset normalization is also stable.

Extending the package

Add a custom implementation when you need a different source or vector store:

  • subclass DocumentLoader for a new ingestion source
  • subclass Store for a new destination
  • subclass Runner if orchestration needs to change

Then point the YAML module_path values at your custom classes.

Repository layout

src/embedding_ingestion/
  __init__.py      # abstract loader/store/runner contracts
  loaders.py       # MinIO dataset loader
  store.py         # Qdrant embedding store
  runners.py       # concrete pipeline runner
  settings.py      # config models
  utils.py         # runner loading and deterministic document IDs
  main.py          # CLI entrypoint

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedding_ingestion-1.0.0.tar.gz (115.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embedding_ingestion-1.0.0-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file embedding_ingestion-1.0.0.tar.gz.

File metadata

  • Download URL: embedding_ingestion-1.0.0.tar.gz
  • Upload date:
  • Size: 115.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for embedding_ingestion-1.0.0.tar.gz
Algorithm Hash digest
SHA256 334b66e2612f32fbaa2c407df77f60673ddb988a23de8872b4b2b5784bbe7542
MD5 24ba9c82c627e63c59dca8a964a5b95a
BLAKE2b-256 c0f5fbd51957d70bc3929e7e224e063b997c4375721681b99d96243b2ee01302

See more details on using hashes here.

File details

Details for the file embedding_ingestion-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: embedding_ingestion-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for embedding_ingestion-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 de44bccd5f17fd42bc7025b7886a46abde570ec2f989f2a5bd3ab19849dc70fb
MD5 d033f74b6ea60655bcd325c005c0b749
BLAKE2b-256 3fd8e9f9a546dfc1e88a171c04b8241d96c3d4aec82669855dc7a1540126db47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page