A robust, extensible Python data tagging framework for dynamic processing and intelligent filtering of pretraining corpora for AI models.
Project description
:bookmark_tabs: Post-It
A robust, extensible Python data tagging framework for dynamic processing and intelligent filtering of pretraining corpora for AI models.
Getting Started
Install from PyPi:
pip install postit
To learn more about using Post-It, please visit the documentation.
Why Data Tagging?
Data is the backbone of machine learning. With a vast variety of companies developing ML models, processing and filtering data to create high-quality datasets is extremely important.
The popularity of continued pretraining (performing pretraining on existing LLMs for domain-adaptation) makes tools like Post-It increasingly important.
In addition, tagging data instead of directly filtering it provides flexibility. It is easy to test the impact of removing different types of data on the final pretraining corpus, enabling quick iteration.
Why Post-It?
- Extensible: Designed for easy adaptation into any number of data processing workflows.
- Fast: Built-in parallization to process large datasets.
- Flexible: Supports local and remote cloud storage.
- Capable: Packaged with a variety of popular taggers, ready to use out of the box.
Contributing
- Clone this repo
- Install Poetry
- Activate Poetry:
poetry shell
- Install dependencies:
poetry install
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file postit-0.0.3.tar.gz
.
File metadata
- Download URL: postit-0.0.3.tar.gz
- Upload date:
- Size: 22.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 45168d5182674a619f7586a7385bc57569a3decfd00ff93a34cb15a794a783f1 |
|
MD5 | 626ec06dab5dd6e168585d0b58c9861b |
|
BLAKE2b-256 | 1fb7f00c7ccacb6630d6c5ab5f518e80f9d848b88c30299bf2199643d502d318 |
File details
Details for the file postit-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: postit-0.0.3-py3-none-any.whl
- Upload date:
- Size: 28.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 89db17e69c5b709b3e955d2f9d254b3c1826af7af9dabb302cf803f9d57f8958 |
|
MD5 | 40cbbeaac96239ed5cfd31eb8e9b1f2c |
|
BLAKE2b-256 | e1b80b3aa3647ce0145eb76c49351d9e57a2dd1cfe521d3c3151d419a716bb9f |