Skip to main content

A robust, extensible Python data tagging framework for dynamic processing and intelligent filtering of pretraining corpora for AI models.

Project description

:bookmark_tabs: Post-It

checks

A robust, extensible Python data tagging framework for dynamic processing and intelligent filtering of pretraining corpora for AI models.

Getting Started

Install from PyPi:

pip install postit

To learn more about using Post-It, please visit the documentation.

Why Data Tagging?

Data is the backbone of machine learning. With a vast variety of companies developing ML models, processing and filtering data to create high-quality datasets is extremely important.

The popularity of continued pretraining (performing pretraining on existing LLMs for domain-adaptation) makes tools like Post-It increasingly important.

In addition, tagging data instead of directly filtering it provides flexibility. It is easy to test the impact of removing different types of data on the final pretraining corpus, enabling quick iteration.

Why Post-It?

  • Extensible: Designed for easy adaptation into any number of data processing workflows.
  • Fast: Built-in parallization to process large datasets.
  • Flexible: Supports local and remote cloud storage.
  • Capable: Packaged with a variety of popular taggers, ready to use out of the box.

Contributing

  • Clone this repo
  • Install Poetry
  • Activate Poetry: poetry shell
  • Install dependencies: poetry install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

postit-0.0.3.tar.gz (22.3 kB view details)

Uploaded Source

Built Distribution

postit-0.0.3-py3-none-any.whl (28.2 kB view details)

Uploaded Python 3

File details

Details for the file postit-0.0.3.tar.gz.

File metadata

  • Download URL: postit-0.0.3.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for postit-0.0.3.tar.gz
Algorithm Hash digest
SHA256 45168d5182674a619f7586a7385bc57569a3decfd00ff93a34cb15a794a783f1
MD5 626ec06dab5dd6e168585d0b58c9861b
BLAKE2b-256 1fb7f00c7ccacb6630d6c5ab5f518e80f9d848b88c30299bf2199643d502d318

See more details on using hashes here.

File details

Details for the file postit-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: postit-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 28.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for postit-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 89db17e69c5b709b3e955d2f9d254b3c1826af7af9dabb302cf803f9d57f8958
MD5 40cbbeaac96239ed5cfd31eb8e9b1f2c
BLAKE2b-256 e1b80b3aa3647ce0145eb76c49351d9e57a2dd1cfe521d3c3151d419a716bb9f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page