No project description provided
Project description
retrievalbase
retrievalbase is a typed Python toolkit for building retrieval and evaluation workflows around
structured text datasets.
It provides:
- dataset connectors for loading and saving text corpora,
- Polars-based dataset abstractions,
- configurable preprocessing pipelines,
- retrieval components such as BM25, dense retrieval, reranking, and vector stores,
- evaluation components for scoring retrieval quality,
- a config-driven runtime model based on Pydantic settings and dynamic component loading.
The project is designed around explicit component contracts rather than a single monolithic pipeline.
Why This Project Exists
Retrieval systems usually degrade when data loading, preprocessing, indexing, retrieval, reranking, and evaluation are tightly coupled.
This repository separates those concerns into components with clear interfaces:
- connectors handle storage and transport,
- datasets handle schema-aware tabular text data,
- preprocessors transform text datasets,
- retrievers execute candidate selection,
- rerankers refine candidate ordering,
- evaluators measure retrieval quality.
That separation makes it easier to:
- swap backends without rewriting orchestration,
- test behavior in isolation,
- drive runtime composition from config,
- keep experimentation reproducible.
Core Ideas
1. Config-Driven Components
Most runtime objects are built from Pydantic settings models derived from
FromConfigMixinSettings.
Each config carries a module_path pointing to the concrete runtime class.
The class is resolved dynamically with retrievalbase.utils.load_class(...) and instantiated
through FromConfigMixin.
This pattern is used across:
- connectors,
- preprocessors,
- token counters,
- embedders,
- vector stores,
- rerankers,
- retrievers,
- evaluators,
- ingestion pipelines.
2. Typed Interfaces
The repository uses abstract base classes to define stable contracts for component categories. Concrete implementations extend those contracts and provide backend-specific behavior.
3. Polars As The Dataset Backbone
Datasets are represented with Polars DataFrame or LazyFrame values wrapped in repository
dataset abstractions.
4. Text Dataset Contract
Text datasets are expected to contain:
page_contentmetadata
Many higher-level components assume that schema.
Repository Layout
.
├── AGENTS.md
├── Makefile
├── README.md
├── pyproject.toml
├── src/
│ └── retrievalbase/
│ ├── connector/
│ ├── dataset/
│ │ └── preprocess/
│ ├── evaluation/
│ │ ├── evaluators/
│ │ │ └── python/
│ │ ├── retrievers/
│ │ │ └── dense/
│ │ ├── async_batcher.py
│ │ ├── embedders.py
│ │ ├── processors.py
│ │ ├── rerankers.py
│ │ ├── settings.py
│ │ └── vector_stores.py
│ ├── ingestion/
│ ├── constants.py
│ ├── enums.py
│ ├── exceptions.py
│ ├── mixins.py
│ ├── settings.py
│ ├── types.py
│ └── utils.py
└── tests/
├── conftest.py
├── fixtures/
│ ├── components.py
│ └── data.py
├── integration/
│ ├── test_dataset/
│ └── test_evaluation/
└── unit/
├── test_config/
├── test_connector/
├── test_dataset/
├── test_evaluation/
├── test_ingestion/
└── test_utils/
High-level responsibility split:
connector/: load and persist datasets from external systems such as parquet and MinIO.dataset/: base dataset abstractions, Polars adapters, Hugging Face adapter, preprocessing, token counting.evaluation/: embedders, processors, async batching, vector stores, rerankers, retrievers, Python evaluators.ingestion/: ingestion pipelines that combine connectors and preprocessors.tests/fixtures/: reusable test data builders, fake components, and component factories.tests/conftest.py: global test setup shared across the suite.tests/unit/test_*/: source-aligned unit test groups for isolated behavior and edge cases.tests/integration/test_*/: multi-component integration tests grouped by module area.
Testing layout conventions:
- mirror source areas with module-oriented test directories such as
tests/unit/test_datasetandtests/integration/test_evaluation, - keep reusable component setup out of individual tests and build test components through shared factories in
tests/fixtures, - add a local
conftest.pyonly when a test group shares setup that should not be global, - prefer parametrized tests when the same behavior should be validated across multiple inputs or component variants.
Installation
Requirements
- Python
>=3.11,<3.13 uvrecommended for dependency management and command execution
Install Production Dependencies
make install
Install Developer Environment
make dev-install
This installs:
- development dependencies,
- optional extras,
- pre-commit hooks.
Development Commands
The Makefile is the source of truth for local development tasks.
make format
make lint
make type-check
make security
make test
make test-cov
make ci
make ci-fast
make clean
Command meaning:
make format: runruff formatandruff check --fixmake lint: run Ruff lint checksmake type-check: runty checkmake security: run Banditmake test: run the test suitemake test-cov: run tests with coverage and enforce 80% minimum coveragemake ci: local CI equivalentmake ci-fast: faster loop without security gate
For narrow test runs during development, prefer targeting the relevant module directory, for example:
uv run pytest tests/unit/test_dataset
uv run pytest tests/integration/test_evaluation
Architecture Overview
Shared Infrastructure
Shared infrastructure lives in:
retrievalbase.mixinsretrievalbase.settingsretrievalbase.typesretrievalbase.utils
These modules provide:
- config loading,
- runtime factories,
- reusable type variables,
- dynamic module resolution,
- shared schema helpers.
Connectors
Connectors are the storage boundary.
Base contract:
DatasetConnector
Current implementations:
ParquetDatasetConnectorMinioDatasetConnector
Connector rules:
_load()returns Polars data,to(ds)persists a dataset,- connectors should not contain retrieval or preprocessing business logic.
Datasets
Base contracts:
DatasetTextDataset
Concrete Polars implementations:
PolarsDatasetPolarsTextDataset
Dataset responsibilities:
- expose Polars-backed operations,
- validate required schema for text data,
- provide convenience conversions and iteration helpers.
Preprocessing
Base contracts:
TextPreprocessorTokenCounter
Current preprocessing components include token-based filters and preprocess pipelines.
Design rule:
- preprocessors accept a
TextDatasetand return aTextDataset, - token counters stay focused on counting,
- pipelines compose preprocessing steps instead of duplicating orchestration.
Ingestion
Base runtime:
TextIngestionPipeline
Typical flow:
DatasetConnector -> TextDataset -> TextPreprocessor -> TextDataset
Evaluation Stack
Important contracts:
ProcessorEmbedderVectorStoreRerankerRetrieverEvaluator
Typical dense retrieval flow:
query -> Processor -> Embedder -> VectorStore -> Reranker -> results
Typical BM25 flow:
query -> Retriever over TextDataset -> optional Reranker -> results
Typical evaluation flow:
dataset + retriever -> evaluator -> scores
Current evaluation coverage in the codebase includes:
- async batching helpers,
- BM25, dense, and hybrid retriever behavior,
- reranker and vector store contracts,
- Python evaluator runtime and score calculation paths.
How Components Are Composed
The system uses configuration to compose components instead of hard-coding most concrete classes.
Common pattern:
- Define a settings model.
- Include
module_path. - Validate config with Pydantic.
- Resolve the runtime class dynamically.
- Instantiate the runtime object from config.
This allows nested configuration.
For example:
- a retriever config can include a reranker config,
- an evaluator config can include a retriever config and a dataset connector config,
- an ingestion pipeline can include both connector and preprocessor configs.
Minimal Example: Build A Text Dataset
from retrievalbase.dataset.polars import PolarsTextDataset
ds = PolarsTextDataset.from_records(
[
("hello world", {"doc_id": "1"}),
("retrieval base", {"doc_id": "2"}),
]
)
print(ds.polars)
Minimal Example: Load Text Data From Parquet
from retrievalbase.dataset.polars import PolarsTextDataset
ds = PolarsTextDataset.from_parquet("data/corpus.parquet", lazy=True)
print(len(ds))
Minimal Example: Config-Driven Component Instantiation
from retrievalbase.utils import comp
component = comp("config/component.yaml", key="retriever")
The YAML entry must include a valid module_path.
Best Practices
Code Design
- Prefer composition over deep inheritance.
- Use inheritance only for stable contracts such as connectors, retrievers, rerankers, and evaluators.
- Keep settings validation in settings models, not scattered through runtime logic.
- Keep external I/O at the boundaries. Storage code belongs in connectors, not datasets or retrievers.
- Keep public APIs typed and explicit.
- Make failure modes clear and actionable.
Config Design
- Always include
module_pathfor dynamically loaded components. - Keep nested configs explicit instead of passing untyped dicts deep into the system.
- Put environment-sensitive values such as secrets in settings-compatible sources rather than hard-coding them.
- Reuse existing settings hierarchies before introducing parallel config models.
Dataset Design
- Preserve the text dataset contract:
page_contentandmetadata. - Validate schema as early as possible.
- Prefer Polars-native transformations over row-by-row Python loops when possible.
- Use lazy execution when loading large parquet corpora unless the operation requires eager materialization.
Retrieval And Evaluation
- Keep embedding, vector search, reranking, and scoring as separate concerns.
- Preserve batch ordering in async batch APIs.
- Close async resources when implementations own clients or sockets.
- Add tests for limit semantics, ordering guarantees, and empty input behavior.
Testing
- Put fast isolated logic under
tests/unit. - Put multi-component behavior under
tests/integration. - Test contracts, not just implementation details.
- Add regression tests when fixing a bug.
- Use fixtures and fakes to isolate external systems.
Dependency Hygiene
- Avoid circular dependencies between feature modules.
- Keep abstract interfaces backend-agnostic.
- Add optional backend imports lazily and raise helpful installation errors.
- Do not bypass the config-driven architecture with hard-coded concrete imports in orchestration layers unless there is a narrow local reason.
Recommended Workflow For Contributors
- Install the dev environment with
make dev-install. - Read AGENTS.md before making structural changes.
- Make focused changes in the relevant package slice.
- Add or update tests near the changed behavior.
- Run
make cibefore considering the change done.
Quality Bar
Changes should be considered complete only when they:
- follow the typed component architecture,
- preserve clean dependency direction,
- include tests for changed behavior,
- pass local CI expectations,
- remain understandable without hidden assumptions.
Current Toolchain
Configured in pyproject.toml and Makefile:
- Ruff for formatting and linting
- Ty for static type checking
- Pytest for tests
- Pytest coverage with 80% minimum threshold
- Bandit for security scanning
- Hatchling for packaging
- UV for environment and command management
Notes
- The default YAML config path in shared settings is
/config/config.yaml. - Some optional components require extra dependencies such as
transformersortorch. - When adding new backends, keep those dependencies optional and fail lazily with actionable guidance.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file retrievalbase-2.0.0.tar.gz.
File metadata
- Download URL: retrievalbase-2.0.0.tar.gz
- Upload date:
- Size: 187.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.11 {"installer":{"name":"uv","version":"0.11.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ceb9e4f59011ef80b3d951bb5cf60ef088ffd79ac3510258f2a5d4a8471dc175
|
|
| MD5 |
4e09b96de7ecb2b444125fbc6063f2d0
|
|
| BLAKE2b-256 |
502aa9ba9e88231a462a28475ca34153fadb06f996cb1622c14726a1989216b5
|
File details
Details for the file retrievalbase-2.0.0-py3-none-any.whl.
File metadata
- Download URL: retrievalbase-2.0.0-py3-none-any.whl
- Upload date:
- Size: 38.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.11 {"installer":{"name":"uv","version":"0.11.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3f472a0ff9eb3cbb5a5b7ffbc3d770310329a373f48409dfe33cfc2a7bfeb34
|
|
| MD5 |
004560722d294df680223f42f217994c
|
|
| BLAKE2b-256 |
9868712b3374a14afc2090c37674c6c67e091dbacdba0c90d62066f3150bf967
|