Skip to main content

No project description provided

Project description

retrievalbase

retrievalbase is a typed Python toolkit for building retrieval and evaluation workflows around structured text datasets.

It provides:

  • dataset connectors for loading and saving text corpora,
  • Polars-based dataset abstractions,
  • configurable preprocessing pipelines,
  • retrieval components such as BM25, dense retrieval, reranking, and vector stores,
  • evaluation components for scoring retrieval quality,
  • a config-driven runtime model based on Pydantic settings and dynamic component loading.

The project is designed around explicit component contracts rather than a single monolithic pipeline.

Why This Project Exists

Retrieval systems usually degrade when data loading, preprocessing, indexing, retrieval, reranking, and evaluation are tightly coupled.

This repository separates those concerns into components with clear interfaces:

  • connectors handle storage and transport,
  • datasets handle schema-aware tabular text data,
  • preprocessors transform text datasets,
  • retrievers execute candidate selection,
  • rerankers refine candidate ordering,
  • evaluators measure retrieval quality.

That separation makes it easier to:

  • swap backends without rewriting orchestration,
  • test behavior in isolation,
  • drive runtime composition from config,
  • keep experimentation reproducible.

Core Ideas

1. Config-Driven Components

Most runtime objects are built from Pydantic settings models derived from FromConfigMixinSettings.

Each config carries a module_path pointing to the concrete runtime class. The class is resolved dynamically with retrievalbase.utils.load_class(...) and instantiated through FromConfigMixin.

This pattern is used across:

  • connectors,
  • preprocessors,
  • token counters,
  • embedders,
  • vector stores,
  • rerankers,
  • retrievers,
  • evaluators,
  • ingestion pipelines.

2. Typed Interfaces

The repository uses abstract base classes to define stable contracts for component categories. Concrete implementations extend those contracts and provide backend-specific behavior.

3. Polars As The Dataset Backbone

Datasets are represented with Polars DataFrame or LazyFrame values wrapped in repository dataset abstractions.

4. Text Dataset Contract

Text datasets are expected to contain:

  • page_content
  • metadata

Many higher-level components assume that schema.

Repository Layout

.
├── AGENTS.md
├── Makefile
├── README.md
├── pyproject.toml
├── src/
│   └── retrievalbase/
│       ├── connector/
│       ├── dataset/
│       │   └── preprocess/
│       ├── evaluation/
│       │   ├── evaluators/
│       │   │   └── python/
│       │   ├── retrievers/
│       │   │   └── dense/
│       │   ├── async_batcher.py
│       │   ├── embedders.py
│       │   ├── processors.py
│       │   ├── rerankers.py
│       │   ├── settings.py
│       │   └── vector_stores.py
│       ├── ingestion/
│       ├── constants.py
│       ├── enums.py
│       ├── exceptions.py
│       ├── mixins.py
│       ├── settings.py
│       ├── types.py
│       └── utils.py
└── tests/
    ├── conftest.py
    ├── fixtures/
    │   ├── components.py
    │   └── data.py
    ├── integration/
    │   ├── test_dataset/
    │   └── test_evaluation/
    └── unit/
        ├── test_config/
        ├── test_connector/
        ├── test_dataset/
        ├── test_evaluation/
        ├── test_ingestion/
        └── test_utils/

High-level responsibility split:

  • connector/: load and persist datasets from external systems such as parquet and MinIO.
  • dataset/: base dataset abstractions, Polars adapters, Hugging Face adapter, preprocessing, token counting.
  • evaluation/: embedders, processors, async batching, vector stores, rerankers, retrievers, Python evaluators.
  • ingestion/: ingestion pipelines that combine connectors and preprocessors.
  • tests/fixtures/: reusable test data builders, fake components, and component factories.
  • tests/conftest.py: global test setup shared across the suite.
  • tests/unit/test_*/: source-aligned unit test groups for isolated behavior and edge cases.
  • tests/integration/test_*/: multi-component integration tests grouped by module area.

Testing layout conventions:

  • mirror source areas with module-oriented test directories such as tests/unit/test_dataset and tests/integration/test_evaluation,
  • keep reusable component setup out of individual tests and build test components through shared factories in tests/fixtures,
  • add a local conftest.py only when a test group shares setup that should not be global,
  • prefer parametrized tests when the same behavior should be validated across multiple inputs or component variants.

Installation

Requirements

  • Python >=3.11,<3.13
  • uv recommended for dependency management and command execution

Install Production Dependencies

make install

Install Developer Environment

make dev-install

This installs:

  • development dependencies,
  • optional extras,
  • pre-commit hooks.

Development Commands

The Makefile is the source of truth for local development tasks.

make format
make lint
make type-check
make security
make test
make test-cov
make ci
make ci-fast
make clean

Command meaning:

  • make format: run ruff format and ruff check --fix
  • make lint: run Ruff lint checks
  • make type-check: run ty check
  • make security: run Bandit
  • make test: run the test suite
  • make test-cov: run tests with coverage and enforce 80% minimum coverage
  • make ci: local CI equivalent
  • make ci-fast: faster loop without security gate

For narrow test runs during development, prefer targeting the relevant module directory, for example:

uv run pytest tests/unit/test_dataset
uv run pytest tests/integration/test_evaluation

Architecture Overview

Shared Infrastructure

Shared infrastructure lives in:

  • retrievalbase.mixins
  • retrievalbase.settings
  • retrievalbase.types
  • retrievalbase.utils

These modules provide:

  • config loading,
  • runtime factories,
  • reusable type variables,
  • dynamic module resolution,
  • shared schema helpers.

Connectors

Connectors are the storage boundary.

Base contract:

  • DatasetConnector

Current implementations:

  • ParquetDatasetConnector
  • MinioDatasetConnector

Connector rules:

  • _load() returns Polars data,
  • to(ds) persists a dataset,
  • connectors should not contain retrieval or preprocessing business logic.

Datasets

Base contracts:

  • Dataset
  • TextDataset

Concrete Polars implementations:

  • PolarsDataset
  • PolarsTextDataset

Dataset responsibilities:

  • expose Polars-backed operations,
  • validate required schema for text data,
  • provide convenience conversions and iteration helpers.

Preprocessing

Base contracts:

  • TextPreprocessor
  • TokenCounter

Current preprocessing components include token-based filters and preprocess pipelines.

Design rule:

  • preprocessors accept a TextDataset and return a TextDataset,
  • token counters stay focused on counting,
  • pipelines compose preprocessing steps instead of duplicating orchestration.

Ingestion

Base runtime:

  • TextIngestionPipeline

Typical flow:

DatasetConnector -> TextDataset -> TextPreprocessor -> TextDataset

Evaluation Stack

Important contracts:

  • Processor
  • Embedder
  • VectorStore
  • Reranker
  • Retriever
  • Evaluator

Typical dense retrieval flow:

query -> Processor -> Embedder -> VectorStore -> Reranker -> results

Typical BM25 flow:

query -> Retriever over TextDataset -> optional Reranker -> results

Typical evaluation flow:

dataset + retriever -> evaluator -> scores

Current evaluation coverage in the codebase includes:

  • async batching helpers,
  • BM25, dense, and hybrid retriever behavior,
  • reranker and vector store contracts,
  • Python evaluator runtime and score calculation paths.

How Components Are Composed

The system uses configuration to compose components instead of hard-coding most concrete classes.

Common pattern:

  1. Define a settings model.
  2. Include module_path.
  3. Validate config with Pydantic.
  4. Resolve the runtime class dynamically.
  5. Instantiate the runtime object from config.

This allows nested configuration.

For example:

  • a retriever config can include a reranker config,
  • an evaluator config can include a retriever config and a dataset connector config,
  • an ingestion pipeline can include both connector and preprocessor configs.

Minimal Example: Build A Text Dataset

from retrievalbase.dataset.polars import PolarsTextDataset

ds = PolarsTextDataset.from_records(
    [
        ("hello world", {"doc_id": "1"}),
        ("retrieval base", {"doc_id": "2"}),
    ]
)

print(ds.polars)

Minimal Example: Load Text Data From Parquet

from retrievalbase.dataset.polars import PolarsTextDataset

ds = PolarsTextDataset.from_parquet("data/corpus.parquet", lazy=True)
print(len(ds))

Minimal Example: Config-Driven Component Instantiation

from retrievalbase.utils import comp

component = comp("config/component.yaml", key="retriever")

The YAML entry must include a valid module_path.

Best Practices

Code Design

  • Prefer composition over deep inheritance.
  • Use inheritance only for stable contracts such as connectors, retrievers, rerankers, and evaluators.
  • Keep settings validation in settings models, not scattered through runtime logic.
  • Keep external I/O at the boundaries. Storage code belongs in connectors, not datasets or retrievers.
  • Keep public APIs typed and explicit.
  • Make failure modes clear and actionable.

Config Design

  • Always include module_path for dynamically loaded components.
  • Keep nested configs explicit instead of passing untyped dicts deep into the system.
  • Put environment-sensitive values such as secrets in settings-compatible sources rather than hard-coding them.
  • Reuse existing settings hierarchies before introducing parallel config models.

Dataset Design

  • Preserve the text dataset contract: page_content and metadata.
  • Validate schema as early as possible.
  • Prefer Polars-native transformations over row-by-row Python loops when possible.
  • Use lazy execution when loading large parquet corpora unless the operation requires eager materialization.

Retrieval And Evaluation

  • Keep embedding, vector search, reranking, and scoring as separate concerns.
  • Preserve batch ordering in async batch APIs.
  • Close async resources when implementations own clients or sockets.
  • Add tests for limit semantics, ordering guarantees, and empty input behavior.

Testing

  • Put fast isolated logic under tests/unit.
  • Put multi-component behavior under tests/integration.
  • Test contracts, not just implementation details.
  • Add regression tests when fixing a bug.
  • Use fixtures and fakes to isolate external systems.

Dependency Hygiene

  • Avoid circular dependencies between feature modules.
  • Keep abstract interfaces backend-agnostic.
  • Add optional backend imports lazily and raise helpful installation errors.
  • Do not bypass the config-driven architecture with hard-coded concrete imports in orchestration layers unless there is a narrow local reason.

Recommended Workflow For Contributors

  1. Install the dev environment with make dev-install.
  2. Read AGENTS.md before making structural changes.
  3. Make focused changes in the relevant package slice.
  4. Add or update tests near the changed behavior.
  5. Run make ci before considering the change done.

Quality Bar

Changes should be considered complete only when they:

  • follow the typed component architecture,
  • preserve clean dependency direction,
  • include tests for changed behavior,
  • pass local CI expectations,
  • remain understandable without hidden assumptions.

Current Toolchain

Configured in pyproject.toml and Makefile:

  • Ruff for formatting and linting
  • Ty for static type checking
  • Pytest for tests
  • Pytest coverage with 80% minimum threshold
  • Bandit for security scanning
  • Hatchling for packaging
  • UV for environment and command management

Notes

  • The default YAML config path in shared settings is /config/config.yaml.
  • Some optional components require extra dependencies such as transformers or torch.
  • When adding new backends, keep those dependencies optional and fail lazily with actionable guidance.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

retrievalbase-2.0.0.tar.gz (187.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

retrievalbase-2.0.0-py3-none-any.whl (38.2 kB view details)

Uploaded Python 3

File details

Details for the file retrievalbase-2.0.0.tar.gz.

File metadata

  • Download URL: retrievalbase-2.0.0.tar.gz
  • Upload date:
  • Size: 187.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.11 {"installer":{"name":"uv","version":"0.11.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for retrievalbase-2.0.0.tar.gz
Algorithm Hash digest
SHA256 ceb9e4f59011ef80b3d951bb5cf60ef088ffd79ac3510258f2a5d4a8471dc175
MD5 4e09b96de7ecb2b444125fbc6063f2d0
BLAKE2b-256 502aa9ba9e88231a462a28475ca34153fadb06f996cb1622c14726a1989216b5

See more details on using hashes here.

File details

Details for the file retrievalbase-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: retrievalbase-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 38.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.11 {"installer":{"name":"uv","version":"0.11.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for retrievalbase-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b3f472a0ff9eb3cbb5a5b7ffbc3d770310329a373f48409dfe33cfc2a7bfeb34
MD5 004560722d294df680223f42f217994c
BLAKE2b-256 9868712b3374a14afc2090c37674c6c67e091dbacdba0c90d62066f3150bf967

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page