Skip to main content

Standardized & reproducible data management for recommender systems.

Project description

🧩 DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems

Documentation License Python 3.9+


DataRec Logo

DataRec focuses on the data management phase of recommender systems, promoting standardization, interoperability, and best practices for data filtering, splitting, analysis, and export.

Official repository of the paper:
📄 DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems (SIGIR 2025) doi


📑 Table of Contents


Features ✨

  • Dataset Management: multi-format I/O with dynamic schema specification.
  • Reference Datasets: curated, versioned, and traceable datasets.
  • Filtering Strategies: widely used user/item interaction filters.
  • Splitting Strategies: temporal and random splits for reproducible evaluation.
  • Data Characteristics: compute dataset-level statistics (e.g., sparsity, popularity).
  • Interoperability: export datasets to external recommendation frameworks.
DataRec Architecture

Installation

From PyPI

pip install datarec-lib

From source (recommended for development)

git clone https://github.com/sisinflab/DataRec.git
cd DataRec
python3.9 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
# editable mode + optional dependency groups (defined in pyproject.toml)
pip install -e '.[dev,docs]'

Quickstart 🚀

from datarec.datasets import AmazonOffice
from datarec.processing import FilterOutDuplicatedInteractions, UserItemIterativeKCore
from datarec.splitters import RandomHoldOut

# 1️⃣ Load a reference dataset
data = AmazonOffice(version='2014').prepare_and_load()

# 2️⃣ Apply preprocessing filters
data = FilterOutDuplicatedInteractions().run(data)
data = UserItemIterativeKCore(cores=5).run(data)

# 3️⃣ Split into train/validation/test
splitter = RandomHoldOut(test_ratio=0.2, val_ratio=0.1, seed=42)
splits = splitter.run(data)

train, val, test = splits['train'], splits['val'], splits['test']

Pipeline paths

When using YAML pipelines, store only filenames in the steps and pass the base folders at runtime:

from datarec.pipeline import Pipeline

pipeline = Pipeline.from_yaml("create_pipeline.yml")
pipeline.apply(input_folder="./data", output_folder="./outputs")

For file loaders use filename (instead of path) and for export steps use filename (instead of output_path) in the YAML.


Datasets 📊

The complete and up-to-date list of datasets (with metadata and statistics) is available in the documentation:

👉 Datasets Section


Documentation 📚

Full documentation available at: https://sisinflab.github.io/DataRec/
Includes API reference, guides, tutorials, and dataset overview.


Contributing 🤝

Contributions are welcome!
To contribute:

  1. Create a feature/fix branch.
  2. Add tests and documentation updates as needed.
  3. Run tests before pushing.
  4. Open a pull request describing your changes clearly.

The project also receives updates from a private development repository maintained by SisInfLab.


Citation 📖

If you use DataRec in your research, please cite our SIGIR 2025 paper:

@inproceedings{DBLP:conf/sigir/MancinoBF0MPN25,
  author       = {Alberto Carlo Maria Mancino and
                  Salvatore Bufi and
                  Angela Di Fazio and
                  Antonio Ferrara and
                  Daniele Malitesta and
                  Claudio Pomo and
                  Tommaso Di Noia},
  title        = {DataRec: {A} Python Library for Standardized and Reproducible Data
                  Management in Recommender Systems},
  booktitle    = {{SIGIR}},
  pages        = {3478--3487},
  publisher    = {{ACM}},
  year         = {2025}
}

Authors and Contributors 👥

Authors

  • Alberto Carlo Maria Mancino (Politecnico di Bari)
  • Salvatore Bufi
  • Angela Di Fazio
  • Daniele Malitesta
  • Antonio Ferrara
  • Claudio Pomo
  • Tommaso Di Noia

Contributors


Alberto C. M. Mancino

Angela Di Fazio

Salvatore Bufi

Giuseppe Fasano

Gianluca Colonna

Maria L. N. De Bonis

Marco Valentini

Related Projects 🧩


License 📜

Distributed under the MIT License.
See LICENSE.


Maintained with ❤️ by SisInfLab

DataRec Logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datarec_lib-1.5.3.tar.gz (109.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datarec_lib-1.5.3-py3-none-any.whl (167.5 kB view details)

Uploaded Python 3

File details

Details for the file datarec_lib-1.5.3.tar.gz.

File metadata

  • Download URL: datarec_lib-1.5.3.tar.gz
  • Upload date:
  • Size: 109.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.24

File hashes

Hashes for datarec_lib-1.5.3.tar.gz
Algorithm Hash digest
SHA256 b5d1fcf1e7b73130b3ce72786c42e00c5cbf92c0fbb2d275cbd9a230a8394abb
MD5 245a066e838978177e86dd59d1d12875
BLAKE2b-256 7e11efb9a4ffcbf8d09a086dacbd5339b87bbda1dcbed15b6d0cf220b5c1c072

See more details on using hashes here.

File details

Details for the file datarec_lib-1.5.3-py3-none-any.whl.

File metadata

  • Download URL: datarec_lib-1.5.3-py3-none-any.whl
  • Upload date:
  • Size: 167.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.24

File hashes

Hashes for datarec_lib-1.5.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4d3883d22a15494da050a4f81e973c3ebbe630cf5a2da85ae5382b8c2b054dc6
MD5 8079a9657f3e6dd9d2719404908c6bce
BLAKE2b-256 c0e3c66ee4749e4599d46b79fa8bdf51e81865c6dc2a9e8b15a683b5c2b3669a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page