Skip to main content

Machine Learning libraries for Information Retrieval

Project description

ml4ir: Machine Learning Library for Information Retrieval

Setup

Requirements

  • python3.6+
  • pip3
  • docker (version 18.09+ tested)

Using PIP

ml4ir can be installed as a pip package by using the following command

pip install  'git+https://git@github.com/salesforce/ml4ir#egg=ml4ir&subdirectory=python'

This will install ml4ir-0.0.1 (the current version). In future, when this package is available on PyPI, it will be as simple as pip install ml4ir

Docker (Recommended)

We have set up a docker-compose.yml file for building and using docker containers to train models.

To run unit tests

docker-compose up

To invoke ml4ir with custom arguments with docker, run

/bin/bash tools/run_docker.sh ml4ir \
	python3 ml4ir/base/pipeline.py
    <args>

For ranking applications, specifically, use

/bin/bash tools/run_docker.sh ml4ir \
	python3 ml4ir/applications/ranking/pipeline.py
    <args>

Refer to usage section below for details on how to run ml4ir - ranking

Check ml4ir/applications/ranking/scripts/example_run.sh for a predefined example run.

To run example invocation of ranking application with docker,

/bin/bash python/ml4ir/applications/ranking/scripts/example_run.sh

Virtual Environment

Install virtualenv

pip3 install virtualenv

Create new python3 virtual environment inside your git repo (it's .gitignored, don't worry)

cd $PLACE_YOU_CAlLED_GIT_CLONE/ml4ir
python3 -m venv python/env/.ml4ir_venv3

Activate virtualenv

cd python/
source env/.ml4ir_venv3/bin/activate

Install all dependencies (carefully)

pip3 install --upgrade setuptools
pip install --upgrade pip
pip3 install -r requirements.txt

Note, there are some AWS incompatibilities, gotta fix that, but you can ignore them for now

ERROR: botocore 1.14.9 has requirement docutils<0.16,>=0.10, but you'll have docutils 0.16 which is incompatible.
ERROR: awscli 1.17.9 has requirement docutils<0.16,>=0.10, but you'll have docutils 0.16 which is incompatible.
ERROR: awscli 1.17.9 has requirement rsa<=3.5.0,>=3.1.2, but you'll have rsa 4.0 which is incompatible.
ERROR: tensorflow-probability 0.8.0 has requirement cloudpickle==1.1.1, but you'll have cloudpickle 1.2.2 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement dill<0.3.2,>=0.3.1.1, but you'll have dill 0.3.0 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement httplib2<=0.12.0,>=0.8, but you'll have httplib2 0.17.0 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement pyarrow<0.16.0,>=0.15.1; python_version >= "3.0" or platform_system != "Windows", but you'll have pyarrow 0.14.1 which is incompatible.
ERROR: tfx-bsl 0.15.3 has requirement absl-py<0.9,>=0.7, but you'll have absl-py 0.9.0 which is incompatible.
ERROR: tfx-bsl 0.15.3 has requirement apache-beam[gcp]<2.17,>=2.16, but you'll have apache-beam 2.18.0 which is incompatible.
ERROR: tensorflow-transform 0.15.0 has requirement absl-py<0.9,>=0.7, but you'll have absl-py 0.9.0 which is incompatible.

Note that pre-commit-hooks are required, and installed as a requirement if needed. If an error results that they didn't install, execute pre-commit install to install git hooks in your .git/ directory.

Set the PYTHONPATH environment variable

export PYTHONPATH=$PYTHONPATH:`pwd`/python

Usage

The entrypoint into the training or evaluation functionality of ml4ir is through ml4ir/base/pipeline.py and for application specific overrides, look at `ml4ir/applications/<eg: ranking>/pipeline.py

ml4ir Library

To use ml4ir as a deep learning library to build relevance models, look at the walkthrough under notebooks/PointwiseRankingDemo.ipynb or notebooks/PointwiseRankingDemo.html(contains architecture diagrams). The notebook walks one through building, training, saving, and the entire life cycle of a RelevanceModel from the bottom up. Additionally, the HTML version also sheds light on the design of ml4ir and the data format used.

Applications - Ranking

Examples

Using TFRecord

python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/tfrecord \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format tfrecord \
--execution_mode train_inference_evaluate

Using CSV

python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/csv \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format csv \
--execution_mode train_inference_evaluate

Running in inference mode using the default serving signature

python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/tfrecord \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format tfrecord \
--model_file `pwd`/models/test/final/default \
--execution_mode inference_only

NOTE: Make sure to add the right data and feature config before training models.
TODO: describe how to do this

Running Tests

To run all the python based tests under ml4ir

python3 -m pytest

To run specific tests,

python3 -m pytest /path/to/test/module

Project Organization

The following structure is a little out of date (TODO(jake) - fix it!)

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── ml4ir                <- Source code for use in this project.
│   ├── __init__.py    <- Makes ml4ir a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml4ir-0.0.1.tar.gz (59.3 kB view hashes)

Uploaded Source

Built Distribution

ml4ir-0.0.1-py3-none-any.whl (84.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page