PECOS - Predictions for Enormous and Correlated Output Spaces
Project description
PECOS - Predictions for Enormous and Correlated Output Spaces
PECOS is a versatile and modular machine learning (ML) framework for fast learning and inference on problems with large output spaces, such as extreme multi-label ranking (XMR) and large-scale retrieval. PECOS' design is intentionally agnostic to the specific nature of the inputs and outputs as it is envisioned to be a general-purpose framework for multiple distinct applications.
Given an input, PECOS identifies a small set (10-100) of relevant outputs from amongst an extremely large (~100MM) candidate set and ranks these outputs in terms of relevance.
Features
Extreme Multi-label Ranking and Classification
-
X-Linear (
pecos.xmc.xlinear
): recursive linear models learning to traverse an input from the root of a hierarchical label tree to a few leaf node clusters, and return top-k relevant labels within the clusters as predictions. See more details in the PECOS paper (Yu et al., 2020).- fast real-time inference in C++
- can handle 100MM output space
-
X-Transformer (
pecos.xmc.xtransformer
): a Transformer matcher learning to traverse an input from the root of a hierarchical label tree to a few leaf node clusters, and return top-k relevant labels within the clusters using a linear ranker as predictions. See technical details in X-Transformer paper (Chang et al., 2020) and latest SOTA results in the PECOS paper (Yu et al., 2020).- easy to extend with many pre-trained Transformer models from huggingface transformers.
- one of the State-of-the-art in deep learning based XMC methods.
-
text2text application (
pecos.apps.text2text
): an easy-to-use text classification pipeline (with X-Linear backend) that supports n-gram TFIDF vectorization, classification, and ensemble predictions.
Requirements and Installation
- Python (>=3.6)
- Pip (>=19.3)
See other dependencies in setup.py
You should install PECOS in a virtual environment.
If you're unfamiliar with Python virtual environments, check out the user guide.
Supporting Platforms
- Ubuntu 16.04, 18.04 and 20.04
- Amazon Linux 2
Installation from Wheel
PECOS can be installed using pip as follows:
pip3 install libpecos
Installation from Source
Prerequisite builder tools
- For Ubuntu (16.04, 18.04, 20.04):
apt-get update && apt-get install -y build-essential git python3 python3-distutils python3-venv
- For Amazon Linux 2:
yum -y install python3 python3-devel python3-distutils python3-venv && yum -y install groupinstall 'Development Tools'
Install and develop locally
git clone https://github.com/amzn/pecos
cd pecos
pip3 install --editable ./
Quick Tour
To have a glimpse of how PECOS works, here is a quick tour of using PECOS API for the XMR problem.
Toy Example
The eXtreme Multi-label Ranking (XMR) problem is defined by two matrices
- instance-to-feature matrix
X
, of shapeN by D
inSciPy CSR format
- instance-to-label matrix
Y
, of shapeN by L
inSciPy CSR format
Some toy data matrices are available in the tst-data
folder.
PECOS constructs a hierarchical label tree and learns linear models recursively (e.g., XR-Linear):
>>> from pecos.xmc.xlinear.model import XLinearModel
>>> from pecos.xmc import Indexer, LabelEmbeddingFactory
# Build hierarchical label tree and train a XR-Linear model
>>> label_feat = LabelEmbeddingFactory.create(Y, X)
>>> cluster_chain = Indexer.gen(label_feat)
>>> model = XLinearModel.train(X, Y, C=cluster_chain)
>>> model.save("./save-models")
After learning the model, we do prediction and evaluation
>>> from pecos.utils import smat_util
>>> Yt_pred = model.predict(Xt)
# print precision and recall at k=10
>>> print(smat_util.Metrics.generate(Yt, Yt_pred))
PECOS also offers optimized C++ implementation for fast real-time inference
>>> model = XLinearModel.load("./save-models", is_predict_only=True)
>>> for i in range(X_tst.shape[0]):
>>> y_tst_pred = model.predict(X_tst[i], threads=1)
Citation
If you find PECOS useful, please consider citing our papers.
- H. Yu, K. Zhong, I. Dhillon, PECOS: Prediction for Enormous and Correlated Output Spaces, Arxiv 2020.
@article{yu2020pecos,
title={PECOS: Prediction for Enormous and Correlated Output Spaces},
author={Yu, Hsiang-Fu and Zhong, Kai and Dhillon, Inderjit S},
journal={arXiv preprint arXiv:2010.05878},
year={2020}
}
- W. Chang, H. Yu, K. Zhong, Y. Yang, I. Dhillon, Taming pretrained transformers for extreme multi-label text classification, KDD 2020.
@inproceedings{chang2020taming,
title={Taming pretrained transformers for extreme multi-label text classification},
author={Chang, Wei-Cheng and Yu, Hsiang-Fu and Zhong, Kai and Yang, Yiming and Dhillon, Inderjit S},
booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
pages={3163--3171},
year={2020}
}
License
Copyright (2021) Amazon.com, Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for libpecos-0.1.0-cp39-cp39-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35ff545f6e03b12350033bbeda956ba9614e0b7ee40169d511e7f4e0e7378b70 |
|
MD5 | 162a2714cccf05d9ea281d095d370a22 |
|
BLAKE2b-256 | d0607bf300399529a57e46b65d65a9dcaa3b4a220ae18b4342b91c06d4c7bd7d |
Hashes for libpecos-0.1.0-cp38-cp38-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e0245b36e23f75f9f7adfcc4155b132cba96ab708a2957f73df79dbdf4b5505 |
|
MD5 | eff447a0533d7a5fa69037589ce1670c |
|
BLAKE2b-256 | f17d3e3e81843561312da38e9717916656f262f7ccc692bcbab18f784b335ba6 |
Hashes for libpecos-0.1.0-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f23dae84b49f73f6b217e609c8dd282000f506b3cf363382a515f0a741f76f93 |
|
MD5 | c5b20e34bf08c2ac631e111dbeeffbdb |
|
BLAKE2b-256 | 0abf1d9d0cbc80aad52021e4da5e2d5051d77c4d6ad66ad6d9f798ddb5ecc1d8 |
Hashes for libpecos-0.1.0-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 606d26863aebbe5292d8719d3f2499120031475af9dc252ce55cb0c9fd33c087 |
|
MD5 | fa7ae1e68bc3dfc991c5a4669c59d621 |
|
BLAKE2b-256 | 07f11bd7290616dea346ed573b40577341619723ea8930c1f00cd9088ce062cf |