Standardized & reproducible data management for recommender systems.

These details have not been verified by PyPI

Project links

Project description

🧩 DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems

DataRec focuses on the data management phase of recommender systems, promoting standardization, interoperability, and best practices for data filtering, splitting, analysis, and export.

Official repository of the paper:
📄 DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems (SIGIR 2025) doi

Features ✨

Dataset Management: multi-format I/O with dynamic schema specification.
Reference Datasets: curated, versioned, and traceable datasets.
Filtering Strategies: widely used user/item interaction filters.
Splitting Strategies: temporal and random splits for reproducible evaluation.
Data Characteristics: compute dataset-level statistics (e.g., sparsity, popularity).
Interoperability: export datasets to external recommendation frameworks.

Installation

From PyPI

pip install datarec-lib

From source (recommended for development)

git clone https://github.com/sisinflab/DataRec.git
cd DataRec
python3.9 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
# editable mode + optional dependency groups (defined in pyproject.toml)
pip install -e '.[dev,docs]'

Quickstart 🚀

from datarec.datasets import AmazonOffice
from datarec.processing import FilterOutDuplicatedInteractions, UserItemIterativeKCore
from datarec.splitters import RandomHoldOut

# 1️⃣ Load a reference dataset
data = AmazonOffice(version='2014').prepare_and_load()

# 2️⃣ Apply preprocessing filters
data = FilterOutDuplicatedInteractions().run(data)
data = UserItemIterativeKCore(cores=5).run(data)

# 3️⃣ Split into train/validation/test
splitter = RandomHoldOut(test_ratio=0.2, val_ratio=0.1, seed=42)
splits = splitter.run(data)

train, val, test = splits['train'], splits['val'], splits['test']

Pipeline paths

When using YAML pipelines, store only filenames in the steps and pass the base folders at runtime:

from datarec.pipeline import Pipeline

pipeline = Pipeline.from_yaml("create_pipeline.yml")
pipeline.apply(input_folder="./data", output_folder="./outputs")

For file loaders use filename (instead of path) and for export steps use filename (instead of output_path) in the YAML.

Datasets 📊

The complete and up-to-date list of datasets (with metadata and statistics) is available in the documentation:

👉 Datasets Section

Documentation 📚

Full documentation available at: https://sisinflab.github.io/DataRec/
Includes API reference, guides, tutorials, and dataset overview.

Contributing 🤝

Contributions are welcome!
To contribute:

Create a feature/fix branch.
Add tests and documentation updates as needed.
Run tests before pushing.
Open a pull request describing your changes clearly.

The project also receives updates from a private development repository maintained by SisInfLab.

Citation 📖

If you use DataRec in your research, please cite our SIGIR 2025 paper:

@inproceedings{DBLP:conf/sigir/MancinoBF0MPN25,
  author       = {Alberto Carlo Maria Mancino and
                  Salvatore Bufi and
                  Angela Di Fazio and
                  Antonio Ferrara and
                  Daniele Malitesta and
                  Claudio Pomo and
                  Tommaso Di Noia},
  title        = {DataRec: {A} Python Library for Standardized and Reproducible Data
                  Management in Recommender Systems},
  booktitle    = {{SIGIR}},
  pages        = {3478--3487},
  publisher    = {{ACM}},
  year         = {2025}
}

Authors and Contributors 👥

Authors

Alberto Carlo Maria Mancino (Politecnico di Bari)
Salvatore Bufi
Angela Di Fazio
Daniele Malitesta
Antonio Ferrara
Claudio Pomo
Tommaso Di Noia

Contributors


Alberto C. M. Mancino	Angela Di Fazio	Salvatore Bufi	Giuseppe Fasano
Gianluca Colonna	Maria L. N. De Bonis	Marco Valentini

Related Projects 🧩

Ducho — library for multimodal representation learning: https://github.com/sisinflab/Ducho
D&D4Rec Tutorial (RecSys 2025) — Standard Practices for Data Processing and Multimodal Feature Extraction in Recommendation with DataRec and Ducho:
https://sites.google.com/view/dd4rec-tutorial/home

License 📜

Distributed under the MIT License.
See LICENSE.

Maintained with ❤️ by SisInfLab

DataRec Logo

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.5.7

Apr 24, 2026

1.5.6

Apr 23, 2026

1.5.5

Apr 22, 2026

1.5.4

Feb 12, 2026

1.5.3

Feb 11, 2026

1.5.2

Feb 3, 2026

1.5.0

Jan 18, 2026

This version

1.4.0

Jan 16, 2026

1.3.2

Jan 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datarec_lib-1.4.0.tar.gz (104.9 kB view details)

Uploaded Jan 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datarec_lib-1.4.0-py3-none-any.whl (160.6 kB view details)

Uploaded Jan 16, 2026 Python 3

File details

Details for the file datarec_lib-1.4.0.tar.gz.

File metadata

Download URL: datarec_lib-1.4.0.tar.gz
Upload date: Jan 16, 2026
Size: 104.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.24

File hashes

Hashes for datarec_lib-1.4.0.tar.gz
Algorithm	Hash digest
SHA256	`bc305fd23bce095eb9ab375424fee91e3680dc3322c8af09687b7450a497320a`
MD5	`c3890d4607efb5e218ea5f0485210254`
BLAKE2b-256	`bfda597f2f7ffc853f27cec82406e686fbb841425d40205dbb9069b4b55d8b7f`

See more details on using hashes here.

File details

Details for the file datarec_lib-1.4.0-py3-none-any.whl.

File metadata

Download URL: datarec_lib-1.4.0-py3-none-any.whl
Upload date: Jan 16, 2026
Size: 160.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.24

File hashes

Hashes for datarec_lib-1.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e191a754a0182bd04c5121456dca9c5f85c264e3ecc4764a576f4fe990a47d7c`
MD5	`a84c007fc7ee3e2a2048976489a98e31`
BLAKE2b-256	`94f85a62a4ffe4deb01f85e47044dd7715d8fe67d2128a9a5dc7a31ae46d5838`

See more details on using hashes here.

datarec-lib 1.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🧩 DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems

📑 Table of Contents

Features ✨

Installation

From PyPI

From source (recommended for development)

Quickstart 🚀

Pipeline paths

Datasets 📊

Documentation 📚

Contributing 🤝

Citation 📖

Authors and Contributors 👥

Contributors

Related Projects 🧩

License 📜

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes