Skip to main content

ALIRA classifies a large text corpus according to a natural-language query when exhaustive LLM evaluation is impractical. It iteratively discovers relevant documents using active learning, LLM validation, and classifier refinement.

Project description

ALIRA

Python 3.11+ License: MIT

Active Learning Iterative Retrieval Agent

ALIRA classifies a large text corpus according to a natural-language query when exhaustive LLM evaluation is impractical. It iteratively discovers relevant documents using active learning, LLM validation, and classifier refinement.

Overview

Given a text corpus and a natural-language query, ALIRA bootstraps a binary classifier using LLM-generated synthetic examples (HyDE), then iteratively:

  1. Evaluates candidate documents via an LLM (up to 5 concurrent calls)
  2. Trains a Logistic Regression classifier on the accumulated labels
  3. Predicts relevance scores for the full corpus
  4. Selects the next batch of candidates via stratified sampling across confidence zones
  5. Stops early when the positive-zone prediction drift (RMSE) falls below a threshold

The result is a ranked list of documents predicted to match the query, requiring far fewer LLM calls than exhaustive evaluation.

Quick Start

Requirements

  • Python >= 3.11
  • LLM provider supporting the OpenAI API format (for both chat and embedding endpoints)

Installation

pip install -e .

Configuration

Set the following environment variables (or use a .env file loaded by your entrypoint script):

Variable Description
ALIRA_LLM_BASE_URL Base URL of the LLM API
ALIRA_LLM_API_KEY API key for authentication
ALIRA_LLM_EMBEDDING_MODEL Model name for embedding requests
ALIRA_LLM_BASE_MODEL Model name for chat/evaluation requests

Example

import pandas as pd
from alira import ActiveLearner

# Load corpus
df = pd.read_csv("data/movies.csv")

# Fit active learner
learner = ActiveLearner(corpus=df["text"])
learner.fit(query="sports")

# Get ranked results
df["score"] = learner.predict_proba()
results = df[df["score"] >= 0.5].sort_values("score", ascending=False)

See examples/demo.py for a complete runnable script with logging and result persistence.

API

ActiveLearner

Main entrypoint exported by the alira package.

ActiveLearner(
    corpus: list[str] | pd.Series | np.ndarray,
    embeddings: np.ndarray | pd.Series | None = None,
    n_synthetic: int = 10,
    min_iterations: int = 3,
    max_iterations: int = 20,
    n_eval_per_iteration: int = 30,
    c_value: float = 1.0,
    positive_zone_rmse_threshold: float = 0.01,
    cluster_candidates: bool = False,
    generation_prompt: str | None = None,
    evaluation_prompt: str | None = None,
)
Parameter Description
corpus Collection of texts to search
embeddings Optional pre-computed embeddings aligned 1-to-1 with corpus
n_synthetic Number of synthetic texts to generate for bootstrapping (HyDE)
min_iterations Minimum iterations before early stopping is evaluated
max_iterations Maximum active learning iterations
n_eval_per_iteration Number of texts evaluated per iteration
c_value Inverse regularization strength for LogisticRegression
positive_zone_rmse_threshold Early-stopping threshold for prediction drift in the positive zone
cluster_candidates Whether to cluster candidates within each stratum for diversity
generation_prompt Custom prompt for synthetic text generation
evaluation_prompt Custom prompt for LLM evaluation

Methods

  • fit(query: str) -> Self — Run the active-learning loop and train the classifier.
  • predict_proba(corpus=None, embeddings=None) -> pd.Series — Return predicted probabilities of relevance.
  • predict(corpus=None, embeddings=None) -> pd.Series — Return binary predictions.

Project Structure

.
├── src/
│   └── alira/
│       ├── __init__.py           # Package entrypoint, exports ActiveLearner
│       ├── active_learner.py     # Core ActiveLearner implementation
│       ├── classifiers.py        # LogisticRegressionClassifier
│       ├── evaluation.py         # LLM-based binary evaluation (async, max 5 concurrent)
│       ├── llms.py               # OpenAI API client for chat and embeddings
│       ├── synthetic.py          # Synthetic text generation via LLM (HyDE)
│       └── config.py             # Environment-based configuration
├── examples/
│   ├── demo.py                   # Example script with logging and CSV output
│   ├── lab_explorer.py           # Example using external data source
│   ├── compare.py                # Utility to compare result sets
│   ├── aists.py                  # Batch runner for AISTS themes
│   ├── embeddings.py             # Generate and cache embeddings
│   └── utils.py                  # Shared example utilities
├── pyproject.toml
└── README.md

Dependencies

Core dependencies (see pyproject.toml):

  • numpy
  • openai
  • pandas
  • pydantic
  • scikit-learn

Optional dependencies for the example scripts (install with pip install -e ".[demo]"):

  • python-dotenv
  • epfl-data-index

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alira-0.1.0.tar.gz (889.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

alira-0.1.0-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file alira-0.1.0.tar.gz.

File metadata

  • Download URL: alira-0.1.0.tar.gz
  • Upload date:
  • Size: 889.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for alira-0.1.0.tar.gz
Algorithm Hash digest
SHA256 242ad0af824ee61f20ce1269abd2847789abd126444449c53245c2dcaf17f353
MD5 54e361dc2dcb98ab948254209c54dd10
BLAKE2b-256 88326c72248cfb162ff80605a5321ad950d4e253d914ad2c41c2bc930a31351a

See more details on using hashes here.

Provenance

The following attestation bundles were made for alira-0.1.0.tar.gz:

Publisher: publish.yml on epfl-p-data/alira

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file alira-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: alira-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for alira-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9901ce56d536ce777196c2537705fd8474f4709b7c67bfb9edee0ed08ea7c0ed
MD5 e352df67ea5bd4527574f8ba4110b185
BLAKE2b-256 42474a5acb8c942f914676e7973fafd2af52c66773ab712a9653310ea078fb90

See more details on using hashes here.

Provenance

The following attestation bundles were made for alira-0.1.0-py3-none-any.whl:

Publisher: publish.yml on epfl-p-data/alira

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page