Skip to main content

Lifestream data analysis with PyTorch

Project description

ptls-logo

GitHub license PyPI version GitHub issues Telegram

pytorch-lifestream or ptls a library built upon PyTorch for building embeddings on discrete event sequences using self-supervision. It can process terabyte-size volumes of raw events like game history events, clickstream data, purchase history or card transactions.

It supports various methods of self-supervised training, adapted for event sequences:

  • Contrastive Learning for Event Sequences (CoLES)
  • Contrastive Predictive Coding (CPC)
  • Replaced Token Detection (RTD) from ELECTRA
  • Next Sequence Prediction (NSP) from BERT
  • Sequences Order Prediction (SOP) from ALBERT
  • Masked Language Model (MLM) from ROBERTA

It supports several types of encoders, including Transformer and RNN. It also supports many types of self-supervised losses.

The following variants of the contrastive losses are supported:

Install from PyPi

pip install pytorch-lifestream

Install from source

# Ubuntu 20.04

sudo apt install python3.8 python3-venv
pip3 install pipenv

pipenv sync --dev # install packages exactly as specified in Pipfile.lock
pipenv shell
pytest

Demo notebooks

  • Supervised model training notebook
  • Self-supervided training and embeddings for downstream task notebook Open In Colab
  • Self-supervided training and embeddings for clients' transactions notebook Open In Colab
  • Self-supervided embeddings in CatBoost notebook
  • Self-supervided training and fine-tuning notebook
  • Self-supervised TrxEncoder only training with Masked Language Model task and fine-tuning notebook
  • Pandas data preprocessing options notebook
  • PySpark and Parquet for data preprocessing notebook
  • Fast inference on large dataset notebook
  • Supervised multilabel classification notebook
  • Text features demo:
    • Using pretrained encoder to text features notebook

Docs

Documentation

Library description index

Experiments on public datasets

pytorch-lifestream usage experiments on several public event datasets are available in the separate repo

PyTorch-LifeStream in ML competitions

How to contribute

  1. Make your chages via Fork and Pull request.
  2. Write unit test for new code in ptls_tests.
  3. Check unit test via pytest: Example.

Citation

We have a paper you can cite it:

@inproceedings{
   Babaev_2022, series={SIGMOD/PODS ’22},
   title={CoLES: Contrastive Learning for Event Sequences with Self-Supervision},
   url={http://dx.doi.org/10.1145/3514221.3526129},
   DOI={10.1145/3514221.3526129},
   booktitle={Proceedings of the 2022 International Conference on Management of Data},
   publisher={ACM},
   author={Babaev, Dmitrii and Ovsov, Nikita and Kireev, Ivan and Ivanova, Maria and Gusev, Gleb and Nazarov, Ivan and Tuzhilin, Alexander},
   year={2022},
   month=jun, collection={SIGMOD/PODS ’22}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytorch-lifestream-0.6.0.tar.gz (163.4 kB view details)

Uploaded Source

File details

Details for the file pytorch-lifestream-0.6.0.tar.gz.

File metadata

  • Download URL: pytorch-lifestream-0.6.0.tar.gz
  • Upload date:
  • Size: 163.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for pytorch-lifestream-0.6.0.tar.gz
Algorithm Hash digest
SHA256 9c77af81a0d854614b93f34fdcdfc572466c656ef51475f1323f92b5d304af75
MD5 b7b7d2ee15d785d426a3850c20fe973e
BLAKE2b-256 c787d395e834cb8156ce4e88272c66239d914563cc368a8b6fd94809c0561d85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page