ALIRA classifies a large text corpus according to a natural-language query when exhaustive LLM evaluation is impractical. It iteratively discovers relevant documents using active learning, LLM validation, and classifier refinement.

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

aitorperez

These details have not been verified by PyPI

Project description

ALIRA

Python 3.11+

Active Learning Iterative Retrieval Agent

Overview

Given a text corpus and a natural-language query, ALIRA bootstraps a binary classifier using LLM-generated synthetic examples (HyDE), then iteratively:

Evaluates candidate documents via an LLM (up to 5 concurrent calls)
Trains a Logistic Regression classifier on the accumulated labels
Predicts relevance scores for the full corpus
Selects the next batch of candidates via stratified sampling across confidence zones
Stops early when the positive-zone prediction drift (RMSE) falls below a threshold

The result is a ranked list of documents predicted to match the query, requiring far fewer LLM calls than exhaustive evaluation.

Quick Start

Requirements

Python >= 3.11
LLM provider supporting the OpenAI API format (for both chat and embedding endpoints)

Installation

pip install -e .

Configuration

Set the following environment variables (or use a .env file loaded by your entrypoint script):

Variable	Description
`ALIRA_LLM_BASE_URL`	Base URL of the LLM API
`ALIRA_LLM_API_KEY`	API key for authentication
`ALIRA_LLM_EMBEDDING_MODEL`	Model name for embedding requests
`ALIRA_LLM_BASE_MODEL`	Model name for chat/evaluation requests

Example

import pandas as pd
from alira import ActiveLearner

# Load corpus
df = pd.read_csv("data/movies.csv")

# Fit active learner
learner = ActiveLearner(corpus=df["text"])
learner.fit(query="sports")

# Get ranked results
df["score"] = learner.predict_proba()
results = df[df["score"] >= 0.5].sort_values("score", ascending=False)

See examples/demo.py for a complete runnable script with logging and result persistence.

API

`ActiveLearner`

Main entrypoint exported by the alira package.

ActiveLearner(
    corpus: list[str] | pd.Series | np.ndarray,
    embeddings: np.ndarray | pd.Series | None = None,
    n_synthetic: int = 10,
    min_iterations: int = 3,
    max_iterations: int = 20,
    n_eval_per_iteration: int = 30,
    c_value: float = 1.0,
    positive_zone_rmse_threshold: float = 0.01,
    cluster_candidates: bool = False,
    generation_prompt: str | None = None,
    evaluation_prompt: str | None = None,
)

Parameter	Description
`corpus`	Collection of texts to search
`embeddings`	Optional pre-computed embeddings aligned 1-to-1 with `corpus`
`n_synthetic`	Number of synthetic texts to generate for bootstrapping (HyDE)
`min_iterations`	Minimum iterations before early stopping is evaluated
`max_iterations`	Maximum active learning iterations
`n_eval_per_iteration`	Number of texts evaluated per iteration
`c_value`	Inverse regularization strength for LogisticRegression
`positive_zone_rmse_threshold`	Early-stopping threshold for prediction drift in the positive zone
`cluster_candidates`	Whether to cluster candidates within each stratum for diversity
`generation_prompt`	Custom prompt for synthetic text generation
`evaluation_prompt`	Custom prompt for LLM evaluation

Methods

fit(query: str) -> Self — Run the active-learning loop and train the classifier.
predict_proba(corpus=None, embeddings=None) -> pd.Series — Return predicted probabilities of relevance.
predict(corpus=None, embeddings=None) -> pd.Series — Return binary predictions.

Project Structure

.
├── src/
│   └── alira/
│       ├── __init__.py           # Package entrypoint, exports ActiveLearner
│       ├── active_learner.py     # Core ActiveLearner implementation
│       ├── classifiers.py        # LogisticRegressionClassifier
│       ├── evaluation.py         # LLM-based binary evaluation (async, max 5 concurrent)
│       ├── llms.py               # OpenAI API client for chat and embeddings
│       ├── synthetic.py          # Synthetic text generation via LLM (HyDE)
│       └── config.py             # Environment-based configuration
├── examples/
│   ├── demo.py                   # Example script with logging and CSV output
│   ├── lab_explorer.py           # Example using external data source
│   ├── compare.py                # Utility to compare result sets
│   ├── aists.py                  # Batch runner for AISTS themes
│   ├── embeddings.py             # Generate and cache embeddings
│   └── utils.py                  # Shared example utilities
├── pyproject.toml
└── README.md

Dependencies

Core dependencies (see pyproject.toml):

numpy
openai
pandas
pydantic
scikit-learn

Optional dependencies for the example scripts (install with pip install -e ".[demo]"):

python-dotenv
epfl-data-index

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

aitorperez

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alira-0.1.0.tar.gz (889.5 kB view details)

Uploaded Jun 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

alira-0.1.0-py3-none-any.whl (12.5 kB view details)

Uploaded Jun 18, 2026 Python 3

File details

Details for the file alira-0.1.0.tar.gz.

File metadata

Download URL: alira-0.1.0.tar.gz
Upload date: Jun 18, 2026
Size: 889.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for alira-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`242ad0af824ee61f20ce1269abd2847789abd126444449c53245c2dcaf17f353`
MD5	`54e361dc2dcb98ab948254209c54dd10`
BLAKE2b-256	`88326c72248cfb162ff80605a5321ad950d4e253d914ad2c41c2bc930a31351a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for alira-0.1.0.tar.gz:

Publisher: publish.yml on epfl-p-data/alira

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: alira-0.1.0.tar.gz
- Subject digest: 242ad0af824ee61f20ce1269abd2847789abd126444449c53245c2dcaf17f353
- Sigstore transparency entry: 1859903142
- Sigstore integration time: Jun 18, 2026
Source repository:
- Permalink: epfl-p-data/alira@2761c62abe970590385aadffcb9c8b0486d8900a
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/epfl-p-data
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2761c62abe970590385aadffcb9c8b0486d8900a
- Trigger Event: release

File details

Details for the file alira-0.1.0-py3-none-any.whl.

File metadata

Download URL: alira-0.1.0-py3-none-any.whl
Upload date: Jun 18, 2026
Size: 12.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for alira-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9901ce56d536ce777196c2537705fd8474f4709b7c67bfb9edee0ed08ea7c0ed`
MD5	`e352df67ea5bd4527574f8ba4110b185`
BLAKE2b-256	`42474a5acb8c942f914676e7973fafd2af52c66773ab712a9653310ea078fb90`

See more details on using hashes here.

Provenance

The following attestation bundles were made for alira-0.1.0-py3-none-any.whl:

Publisher: publish.yml on epfl-p-data/alira

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: alira-0.1.0-py3-none-any.whl
- Subject digest: 9901ce56d536ce777196c2537705fd8474f4709b7c67bfb9edee0ed08ea7c0ed
- Sigstore transparency entry: 1859903152
- Sigstore integration time: Jun 18, 2026
Source repository:
- Permalink: epfl-p-data/alira@2761c62abe970590385aadffcb9c8b0486d8900a
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/epfl-p-data
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2761c62abe970590385aadffcb9c8b0486d8900a
- Trigger Event: release

alira 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ALIRA

Overview

Quick Start

Requirements

Installation

Configuration

Example

API

ActiveLearner

Methods

Project Structure

Dependencies

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`ActiveLearner`