No project description provided

Project description

retrievalbase

retrievalbase is a typed Python toolkit for building retrieval and evaluation workflows around structured text datasets.

It provides:

dataset connectors for loading and saving text corpora,
Polars-based dataset abstractions,
configurable preprocessing pipelines,
retrieval components such as BM25, dense retrieval, reranking, and vector stores,
evaluation components for scoring retrieval quality,
a config-driven runtime model based on Pydantic settings and dynamic component loading.

The project is designed around explicit component contracts rather than a single monolithic pipeline.

Why This Project Exists

Retrieval systems usually degrade when data loading, preprocessing, indexing, retrieval, reranking, and evaluation are tightly coupled.

This repository separates those concerns into components with clear interfaces:

connectors handle storage and transport,
datasets handle schema-aware tabular text data,
preprocessors transform text datasets,
retrievers execute candidate selection,
rerankers refine candidate ordering,
evaluators measure retrieval quality.

That separation makes it easier to:

swap backends without rewriting orchestration,
test behavior in isolation,
drive runtime composition from config,
keep experimentation reproducible.

Core Ideas

1. Config-Driven Components

Most runtime objects are built from Pydantic settings models derived from FromConfigMixinSettings.

Each config carries a module_path pointing to the concrete runtime class. The class is resolved dynamically with retrievalbase.utils.load_class(...) and instantiated through FromConfigMixin.

This pattern is used across:

connectors,
preprocessors,
token counters,
embedders,
vector stores,
rerankers,
retrievers,
evaluators,
ingestion pipelines.

2. Typed Interfaces

The repository uses abstract base classes to define stable contracts for component categories. Concrete implementations extend those contracts and provide backend-specific behavior.

3. Polars As The Dataset Backbone

Datasets are represented with Polars DataFrame or LazyFrame values wrapped in repository dataset abstractions.

4. Text Dataset Contract

Text datasets are expected to contain:

page_content
metadata

Many higher-level components assume that schema.

Repository Layout

.
├── AGENTS.md
├── Makefile
├── README.md
├── pyproject.toml
├── src/
│   └── retrievalbase/
│       ├── connector/
│       ├── dataset/
│       │   └── preprocess/
│       ├── evaluation/
│       │   ├── evaluators/
│       │   │   └── python/
│       │   ├── retrievers/
│       │   │   └── dense/
│       │   ├── async_batcher.py
│       │   ├── embedders.py
│       │   ├── processors.py
│       │   ├── rerankers.py
│       │   ├── settings.py
│       │   └── vector_stores.py
│       ├── ingestion/
│       ├── constants.py
│       ├── enums.py
│       ├── exceptions.py
│       ├── mixins.py
│       ├── settings.py
│       ├── types.py
│       └── utils.py
└── tests/
    ├── conftest.py
    ├── fixtures/
    │   ├── components.py
    │   └── data.py
    ├── integration/
    │   ├── test_dataset/
    │   └── test_evaluation/
    └── unit/
        ├── test_config/
        ├── test_connector/
        ├── test_dataset/
        ├── test_evaluation/
        ├── test_ingestion/
        └── test_utils/

High-level responsibility split:

connector/: load and persist datasets from external systems such as parquet and MinIO.
dataset/: base dataset abstractions, Polars adapters, Hugging Face adapter, preprocessing, token counting.
evaluation/: embedders, processors, async batching, vector stores, rerankers, retrievers, Python evaluators.
ingestion/: ingestion pipelines that combine connectors and preprocessors.
tests/fixtures/: reusable test data builders, fake components, and component factories.
tests/conftest.py: global test setup shared across the suite.
tests/unit/test_*/: source-aligned unit test groups for isolated behavior and edge cases.
tests/integration/test_*/: multi-component integration tests grouped by module area.

Testing layout conventions:

mirror source areas with module-oriented test directories such as tests/unit/test_dataset and tests/integration/test_evaluation,
keep reusable component setup out of individual tests and build test components through shared factories in tests/fixtures,
add a local conftest.py only when a test group shares setup that should not be global,
prefer parametrized tests when the same behavior should be validated across multiple inputs or component variants.

Installation

Requirements

Python >=3.11,<3.13
uv recommended for dependency management and command execution

Install Production Dependencies

make install

Install Developer Environment

make dev-install

This installs:

development dependencies,
optional extras,
pre-commit hooks.

Development Commands

The Makefile is the source of truth for local development tasks.

make format
make lint
make type-check
make security
make test
make test-cov
make ci
make ci-fast
make clean

Command meaning:

make format: run ruff format and ruff check --fix
make lint: run Ruff lint checks
make type-check: run ty check
make security: run Bandit
make test: run the test suite
make test-cov: run tests with coverage and enforce 80% minimum coverage
make ci: local CI equivalent
make ci-fast: faster loop without security gate

For narrow test runs during development, prefer targeting the relevant module directory, for example:

uv run pytest tests/unit/test_dataset
uv run pytest tests/integration/test_evaluation

Architecture Overview

Shared Infrastructure

Shared infrastructure lives in:

retrievalbase.mixins
retrievalbase.settings
retrievalbase.types
retrievalbase.utils

These modules provide:

config loading,
runtime factories,
reusable type variables,
dynamic module resolution,
shared schema helpers.

Connectors

Connectors are the storage boundary.

Base contract:

DatasetConnector

Current implementations:

ParquetDatasetConnector
MinioDatasetConnector

Connector rules:

_load() returns Polars data,
to(ds) persists a dataset,
connectors should not contain retrieval or preprocessing business logic.

Datasets

Base contracts:

Dataset
TextDataset

Concrete Polars implementations:

PolarsDataset
PolarsTextDataset

Dataset responsibilities:

expose Polars-backed operations,
validate required schema for text data,
provide convenience conversions and iteration helpers.

Preprocessing

Base contracts:

TextPreprocessor
TokenCounter

Current preprocessing components include token-based filters and preprocess pipelines.

Design rule:

preprocessors accept a TextDataset and return a TextDataset,
token counters stay focused on counting,
pipelines compose preprocessing steps instead of duplicating orchestration.

Ingestion

Base runtime:

TextIngestionPipeline

Typical flow:

DatasetConnector -> TextDataset -> TextPreprocessor -> TextDataset

Evaluation Stack

Important contracts:

Processor
Embedder
VectorStore
Reranker
Retriever
Evaluator

Typical dense retrieval flow:

query -> Processor -> Embedder -> VectorStore -> Reranker -> results

Typical BM25 flow:

query -> Retriever over TextDataset -> optional Reranker -> results

Typical evaluation flow:

dataset + retriever -> evaluator -> scores

Current evaluation coverage in the codebase includes:

async batching helpers,
BM25, dense, and hybrid retriever behavior,
reranker and vector store contracts,
Python evaluator runtime and score calculation paths.

How Components Are Composed

The system uses configuration to compose components instead of hard-coding most concrete classes.

Common pattern:

Define a settings model.
Include module_path.
Validate config with Pydantic.
Resolve the runtime class dynamically.
Instantiate the runtime object from config.

This allows nested configuration.

For example:

a retriever config can include a reranker config,
an evaluator config can include a retriever config and a dataset connector config,
an ingestion pipeline can include both connector and preprocessor configs.

Minimal Example: Build A Text Dataset

from retrievalbase.dataset.polars import PolarsTextDataset

ds = PolarsTextDataset.from_records(
    [
        ("hello world", {"doc_id": "1"}),
        ("retrieval base", {"doc_id": "2"}),
    ]
)

print(ds.polars)

Minimal Example: Load Text Data From Parquet

from retrievalbase.dataset.polars import PolarsTextDataset

ds = PolarsTextDataset.from_parquet("data/corpus.parquet", lazy=True)
print(len(ds))

Minimal Example: Config-Driven Component Instantiation

from retrievalbase.utils import comp

component = comp("config/component.yaml", key="retriever")

The YAML entry must include a valid module_path.

Best Practices

Code Design

Prefer composition over deep inheritance.
Use inheritance only for stable contracts such as connectors, retrievers, rerankers, and evaluators.
Keep settings validation in settings models, not scattered through runtime logic.
Keep external I/O at the boundaries. Storage code belongs in connectors, not datasets or retrievers.
Keep public APIs typed and explicit.
Make failure modes clear and actionable.

Config Design

Always include module_path for dynamically loaded components.
Keep nested configs explicit instead of passing untyped dicts deep into the system.
Put environment-sensitive values such as secrets in settings-compatible sources rather than hard-coding them.
Reuse existing settings hierarchies before introducing parallel config models.

Dataset Design

Preserve the text dataset contract: page_content and metadata.
Validate schema as early as possible.
Prefer Polars-native transformations over row-by-row Python loops when possible.
Use lazy execution when loading large parquet corpora unless the operation requires eager materialization.

Retrieval And Evaluation

Keep embedding, vector search, reranking, and scoring as separate concerns.
Preserve batch ordering in async batch APIs.
Close async resources when implementations own clients or sockets.
Add tests for limit semantics, ordering guarantees, and empty input behavior.

Testing

Put fast isolated logic under tests/unit.
Put multi-component behavior under tests/integration.
Test contracts, not just implementation details.
Add regression tests when fixing a bug.
Use fixtures and fakes to isolate external systems.

Dependency Hygiene

Avoid circular dependencies between feature modules.
Keep abstract interfaces backend-agnostic.
Add optional backend imports lazily and raise helpful installation errors.
Do not bypass the config-driven architecture with hard-coded concrete imports in orchestration layers unless there is a narrow local reason.

Recommended Workflow For Contributors

Install the dev environment with make dev-install.
Read AGENTS.md before making structural changes.
Make focused changes in the relevant package slice.
Add or update tests near the changed behavior.
Run make ci before considering the change done.

Quality Bar

Changes should be considered complete only when they:

follow the typed component architecture,
preserve clean dependency direction,
include tests for changed behavior,
pass local CI expectations,
remain understandable without hidden assumptions.

Current Toolchain

Configured in pyproject.toml and Makefile:

Ruff for formatting and linting
Ty for static type checking
Pytest for tests
Pytest coverage with 80% minimum threshold
Bandit for security scanning
Hatchling for packaging
UV for environment and command management

Notes

The default YAML config path in shared settings is /config/config.yaml.
Some optional components require extra dependencies such as transformers or torch.
When adding new backends, keep those dependencies optional and fail lazily with actionable guidance.

Project details

Release history Release notifications | RSS feed

2.1.0

May 9, 2026

This version

2.0.0

May 8, 2026

1.0.0

Apr 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

retrievalbase-2.0.0.tar.gz (187.6 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

retrievalbase-2.0.0-py3-none-any.whl (38.2 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file retrievalbase-2.0.0.tar.gz.

File metadata

Download URL: retrievalbase-2.0.0.tar.gz
Upload date: May 8, 2026
Size: 187.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.11 {"installer":{"name":"uv","version":"0.11.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for retrievalbase-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`ceb9e4f59011ef80b3d951bb5cf60ef088ffd79ac3510258f2a5d4a8471dc175`
MD5	`4e09b96de7ecb2b444125fbc6063f2d0`
BLAKE2b-256	`502aa9ba9e88231a462a28475ca34153fadb06f996cb1622c14726a1989216b5`

See more details on using hashes here.

File details

Details for the file retrievalbase-2.0.0-py3-none-any.whl.

File metadata

Download URL: retrievalbase-2.0.0-py3-none-any.whl
Upload date: May 8, 2026
Size: 38.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.11 {"installer":{"name":"uv","version":"0.11.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for retrievalbase-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b3f472a0ff9eb3cbb5a5b7ffbc3d770310329a373f48409dfe33cfc2a7bfeb34`
MD5	`004560722d294df680223f42f217994c`
BLAKE2b-256	`9868712b3374a14afc2090c37674c6c67e091dbacdba0c90d62066f3150bf967`

See more details on using hashes here.

retrievalbase 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

retrievalbase

Why This Project Exists

Core Ideas

1. Config-Driven Components

2. Typed Interfaces

3. Polars As The Dataset Backbone

4. Text Dataset Contract

Repository Layout

Installation

Requirements

Install Production Dependencies

Install Developer Environment

Development Commands

Architecture Overview

Shared Infrastructure

Connectors

Datasets

Preprocessing

Ingestion

Evaluation Stack

How Components Are Composed

Minimal Example: Build A Text Dataset

Minimal Example: Load Text Data From Parquet

Minimal Example: Config-Driven Component Instantiation

Best Practices

Code Design

Config Design

Dataset Design

Retrieval And Evaluation

Testing

Dependency Hygiene

Recommended Workflow For Contributors

Quality Bar

Current Toolchain

Notes

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes