A Learning-to-Rank library with LambdaMART, BM25, and MovieLens support
Project description
Learning-to-Rank from Scratch
A complete implementation of a Learning-to-Rank system using LambdaMART with LightGBM for query-document ranking on the MovieLens dataset.
🎯 Overview
This project implements a state-of-the-art ranking system that learns to rank movies for users based on:
- Features: TF-IDF similarity, document popularity, engagement signals
- Model: LambdaMART using LightGBM with pairwise preference learning
- Baseline: BM25 for comparison
- Evaluation: NDCG@10, MAP (Mean Average Precision), Precision@K
- Validation: 5-fold cross-validation with comprehensive metric comparison
📊 Dataset
MovieLens 100K - Contains 100,000 ratings from 943 users on 1,682 movies
- Ratings converted to relevance labels (0-3 scale)
- Query-document-relevance triplets created from user-movie interactions
- Rich metadata including genres, titles, and user demographics
🚀 Quick Start
Prerequisites
pip install -r requirements.txt
Run the Notebook
jupyter notebook learning_to_rank.ipynb
The notebook will:
- Download the MovieLens dataset automatically
- Engineer features from movie metadata and user interactions
- Train LambdaMART model with cross-validation
- Compare against BM25 baseline
- Generate metric comparison charts
- Analyze feature importance
🔧 Features Engineering
1. TF-IDF Similarity Features
- User profiles created from highly-rated movies
- Cosine similarity between user profile and candidate movies
- Captures content-based relevance
2. Document Popularity Features
- Number of ratings per movie
- Average rating and standard deviation
- Number of unique users
- Popularity score (composite metric)
3. Engagement Signal Features
- User activity level (number of ratings)
- User rating patterns (mean, std)
- User demographics (age, gender)
- Movie genre indicators (18 genres)
📈 Model Architecture
LambdaMART Configuration
{
'objective': 'lambdarank',
'metric': 'ndcg',
'ndcg_eval_at': [10],
'learning_rate': 0.05,
'num_leaves': 31,
'max_depth': 6,
'min_data_in_leaf': 20,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 5
}
Training Strategy
- Objective: Pairwise preference learning (lambdarank)
- Optimization: Directly optimizes NDCG
- Cross-validation: 5-fold GroupKFold (groups by user)
- Comparison: BM25 baseline on same splits
📊 Evaluation Metrics
NDCG@10 (Normalized Discounted Cumulative Gain)
- Measures ranking quality with position-based discounting
- Considers graded relevance labels
- Primary metric for ranking evaluation
MAP (Mean Average Precision)
- Evaluates precision across all relevant items
- Emphasizes finding all relevant documents
Precision@K
- Measures fraction of relevant items in top-K results
- Simple interpretable metric
📁 Project Structure
learning-to-rank-from-scratch/
├── learning_to_rank.ipynb # Main notebook with complete implementation
├── requirements.txt # Python dependencies
├── README.md # This file
├── .gitignore # Git ignore rules
└── ml-100k/ # MovieLens dataset (auto-downloaded)
📸 Visualizations
The notebook generates three key visualizations:
- Metric Comparison by Fold - Shows LambdaMART vs BM25 for each CV fold
- Average Metric Comparison - Mean performance with error bars
- Feature Importance - Top contributing features to ranking quality
🎓 Key Concepts
Learning-to-Rank
Learning-to-Rank treats ranking as a supervised machine learning problem:
- Input: Query-document pairs with features
- Output: Relevance scores for ranking
- Approaches: Pointwise, Pairwise (this project), Listwise
LambdaMART
LambdaMART combines:
- LambdaRank: Uses lambda gradients from pairwise preferences
- MART (Multiple Additive Regression Trees): Gradient boosted decision trees
- Direct NDCG optimization: Optimizes the actual ranking metric
Why Pairwise Learning?
- More data efficient than pointwise approaches
- Captures relative ordering directly
- Better suited for ranking tasks than regression
🔬 Expected Results
LambdaMART typically outperforms BM25 baseline by:
- NDCG@10: 10-30% improvement
- MAP: 15-25% improvement
- Precision@10: 10-20% improvement
Results may vary based on:
- Train/test split
- Feature engineering quality
- Hyperparameter tuning
- Dataset characteristics
🛠️ Customization
Adding New Features
Edit the feature engineering section in the notebook:
feature_columns = [
'your_new_feature',
# ... existing features
]
Tuning Hyperparameters
Modify the LightGBM parameters:
params = {
'objective': 'lambdarank',
'learning_rate': 0.1, # Adjust
'num_leaves': 63, # Adjust
# ...
}
Using Different Datasets
Replace the MovieLens loading code with your dataset:
- Ensure query-document-relevance triplet format
- Adapt feature engineering to your domain
📚 References
📝 License
This project is licensed under the MIT License - see the LICENSE file for details.
🤝 Contributing
Contributions are welcome! Feel free to:
- Report bugs
- Suggest features
- Submit pull requests
- Improve documentation
⭐ Acknowledgments
- GroupLens Research for the MovieLens dataset
- Microsoft Research for LambdaMART algorithm
- LightGBM team for the excellent gradient boosting framework
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ltr_lib-0.1.0.tar.gz.
File metadata
- Download URL: ltr_lib-0.1.0.tar.gz
- Upload date:
- Size: 48.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d95a6d8dfa32be6418a6bdefdf582e70683835f9f6f26ae63050055a3b73e62a
|
|
| MD5 |
52a4a9d0697301ceb3636d16e1c95ff9
|
|
| BLAKE2b-256 |
d663afe8f945cf0a6a8f94192fd039acf3f32a72b73df6c30e523ca877e0b7d8
|
File details
Details for the file ltr_lib-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ltr_lib-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5af452d0bd72a3c3d7fe1b6153542903509d2f7a4bfea5aff408ac8512c0faba
|
|
| MD5 |
cb78a57ed504abe3d56ef752c0ae854b
|
|
| BLAKE2b-256 |
0aae0cec11e1b319fd9cfb4d71b6e387e6ad5ce1acc8794b7c4f3ee9b89dd4b1
|