ALIRA classifies a large text corpus according to a natural-language query when exhaustive LLM evaluation is impractical. It iteratively discovers relevant documents using active learning, LLM validation, and classifier refinement.
Project description
ALIRA
Active Learning Iterative Retrieval Agent
ALIRA classifies a large text corpus according to a natural-language query when exhaustive LLM evaluation is impractical. It iteratively discovers relevant documents using active learning, LLM validation, and classifier refinement.
Overview
Given a text corpus and a natural-language query, ALIRA bootstraps a binary classifier using LLM-generated synthetic examples (HyDE), then iteratively:
- Evaluates candidate documents via an LLM (up to 5 concurrent calls)
- Trains a Logistic Regression classifier on the accumulated labels
- Predicts relevance scores for the full corpus
- Selects the next batch of candidates via stratified sampling across confidence zones
- Stops early when the positive-zone prediction drift (RMSE) falls below a threshold
The result is a ranked list of documents predicted to match the query, requiring far fewer LLM calls than exhaustive evaluation.
Quick Start
Requirements
- Python >= 3.11
- LLM provider supporting the OpenAI API format (for both chat and embedding endpoints)
Installation
pip install -e .
Configuration
Set the following environment variables (or use a .env file loaded by your entrypoint script):
| Variable | Description |
|---|---|
ALIRA_LLM_BASE_URL |
Base URL of the LLM API |
ALIRA_LLM_API_KEY |
API key for authentication |
ALIRA_LLM_EMBEDDING_MODEL |
Model name for embedding requests |
ALIRA_LLM_BASE_MODEL |
Model name for chat/evaluation requests |
Example
import pandas as pd
from alira import ActiveLearner
# Load corpus
df = pd.read_csv("data/movies.csv")
# Fit active learner
learner = ActiveLearner(corpus=df["text"])
learner.fit(query="sports")
# Get ranked results
df["score"] = learner.predict_proba()
results = df[df["score"] >= 0.5].sort_values("score", ascending=False)
See examples/demo.py for a complete runnable script with logging and result persistence.
API
ActiveLearner
Main entrypoint exported by the alira package.
ActiveLearner(
corpus: list[str] | pd.Series | np.ndarray,
embeddings: np.ndarray | pd.Series | None = None,
n_synthetic: int = 10,
min_iterations: int = 3,
max_iterations: int = 20,
n_eval_per_iteration: int = 30,
c_value: float = 1.0,
positive_zone_rmse_threshold: float = 0.01,
cluster_candidates: bool = False,
generation_prompt: str | None = None,
evaluation_prompt: str | None = None,
)
| Parameter | Description |
|---|---|
corpus |
Collection of texts to search |
embeddings |
Optional pre-computed embeddings aligned 1-to-1 with corpus |
n_synthetic |
Number of synthetic texts to generate for bootstrapping (HyDE) |
min_iterations |
Minimum iterations before early stopping is evaluated |
max_iterations |
Maximum active learning iterations |
n_eval_per_iteration |
Number of texts evaluated per iteration |
c_value |
Inverse regularization strength for LogisticRegression |
positive_zone_rmse_threshold |
Early-stopping threshold for prediction drift in the positive zone |
cluster_candidates |
Whether to cluster candidates within each stratum for diversity |
generation_prompt |
Custom prompt for synthetic text generation |
evaluation_prompt |
Custom prompt for LLM evaluation |
Methods
fit(query: str) -> Self— Run the active-learning loop and train the classifier.predict_proba(corpus=None, embeddings=None) -> pd.Series— Return predicted probabilities of relevance.predict(corpus=None, embeddings=None) -> pd.Series— Return binary predictions.
Project Structure
.
├── src/
│ └── alira/
│ ├── __init__.py # Package entrypoint, exports ActiveLearner
│ ├── active_learner.py # Core ActiveLearner implementation
│ ├── classifiers.py # LogisticRegressionClassifier
│ ├── evaluation.py # LLM-based binary evaluation (async, max 5 concurrent)
│ ├── llms.py # OpenAI API client for chat and embeddings
│ ├── synthetic.py # Synthetic text generation via LLM (HyDE)
│ └── config.py # Environment-based configuration
├── examples/
│ ├── demo.py # Example script with logging and CSV output
│ ├── lab_explorer.py # Example using external data source
│ ├── compare.py # Utility to compare result sets
│ ├── aists.py # Batch runner for AISTS themes
│ ├── embeddings.py # Generate and cache embeddings
│ └── utils.py # Shared example utilities
├── pyproject.toml
└── README.md
Dependencies
Core dependencies (see pyproject.toml):
- numpy
- openai
- pandas
- pydantic
- scikit-learn
Optional dependencies for the example scripts (install with pip install -e ".[demo]"):
- python-dotenv
- epfl-data-index
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file alira-0.1.0.tar.gz.
File metadata
- Download URL: alira-0.1.0.tar.gz
- Upload date:
- Size: 889.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
242ad0af824ee61f20ce1269abd2847789abd126444449c53245c2dcaf17f353
|
|
| MD5 |
54e361dc2dcb98ab948254209c54dd10
|
|
| BLAKE2b-256 |
88326c72248cfb162ff80605a5321ad950d4e253d914ad2c41c2bc930a31351a
|
Provenance
The following attestation bundles were made for alira-0.1.0.tar.gz:
Publisher:
publish.yml on epfl-p-data/alira
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
alira-0.1.0.tar.gz -
Subject digest:
242ad0af824ee61f20ce1269abd2847789abd126444449c53245c2dcaf17f353 - Sigstore transparency entry: 1859903142
- Sigstore integration time:
-
Permalink:
epfl-p-data/alira@2761c62abe970590385aadffcb9c8b0486d8900a -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/epfl-p-data
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2761c62abe970590385aadffcb9c8b0486d8900a -
Trigger Event:
release
-
Statement type:
File details
Details for the file alira-0.1.0-py3-none-any.whl.
File metadata
- Download URL: alira-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9901ce56d536ce777196c2537705fd8474f4709b7c67bfb9edee0ed08ea7c0ed
|
|
| MD5 |
e352df67ea5bd4527574f8ba4110b185
|
|
| BLAKE2b-256 |
42474a5acb8c942f914676e7973fafd2af52c66773ab712a9653310ea078fb90
|
Provenance
The following attestation bundles were made for alira-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on epfl-p-data/alira
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
alira-0.1.0-py3-none-any.whl -
Subject digest:
9901ce56d536ce777196c2537705fd8474f4709b7c67bfb9edee0ed08ea7c0ed - Sigstore transparency entry: 1859903152
- Sigstore integration time:
-
Permalink:
epfl-p-data/alira@2761c62abe970590385aadffcb9c8b0486d8900a -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/epfl-p-data
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2761c62abe970590385aadffcb9c8b0486d8900a -
Trigger Event:
release
-
Statement type: