This library provides a set of tools for data augmentation, including text generation, data processing, and integration with multiple AI providers.

These details have not been verified by PyPI

Project links

Project description

Data Augmenter Banner

Data Augmenter

Data Augmenter has been created to take advantage of the potential of foundational models by allowing us to generate new data from a small sample. Thanks to Data Augmenter we will be able to increase the size of our datasets by including variability in the data. In addition, we can extract structured datasets ready for fine-tuning of unstructured information.

Installation

It is recommended to use conda environments to manage and install dependencies, but if you prefer to ignore it, skip directly to point 3.

Create an environment You can create a new environment using the conda create command. Replace myenv with your desired environment name and specify the Python version if needed.
```
conda create --name myenv python=3.11
```
Activate the environment After creating the environment, activate it using the following command:
```
conda activate myenv  
```
You should now be working with the activated environment.
Installing dependencies Install ir from PyPI directly using pip:
```
pip install python-data-augmenter
```
At this point Data Augmenter is ready to use.

Modules

This library consists of two modules, augmentation and document_chunker.

Document Chunker

This module contains the DocumentChunker class. This utility has been designed to load and process specific types of files (markdown, txt, pdf and jsonl) by chunking them and inserting them in a dataframe.

Usage

Initialize the DocumentChunker:

from document_chunker import DocumentChunker
chunker = DocumentChunker(chunk_size, chunk_overlap, separator)

Process a File:

file_path = "path/to/your/file.txt"  # Can be .txt, .md, .pdf or .jsonl
dataset = chunker.process_file(file_path)

The output will be an augmentation-ready dataframe. In case you prefer to prepare your own dataset for augmentation, it should be a Pandas dataframe with a column named "document":

docs = [
    "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity.",
    "Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.",
    "All human beings should try to learn before they die what they are running from, and to, and why.",
    "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune, or to take arms against a sea of troubles and by opposing end them."
]
dataset = pd.DataFrame({"document": docs})

Augmentation

This module consists of two main types of classes: Augmenters and Datasets. Augmenters interface with Large Language Models (LLMs) through specified endpoints, providing the functionality to generate new data based on input documents. Datasets handle the dataset structure and offer methods for augmenting, filtering, and storing query-answer pairs relevant to the provided document.

The input dataset should be in the form of a DataFrame with a single column named "document" that contains chunks of your source document. The output will be a .jsonl file, where each entry includes a generated question-answer pair along with the corresponding document chunk. If filtering is applied, each entry will also include the cosine similarity score between the QA pair and its source chunk.

Usage

For the following usage example, we have used a Ollama client exposed at localhost:11434 port 80 with the tinyllama 1.1b model.

Initialize TGIAugmenter:

from augmenter import TGIAugmenter
augmenter = OllamaAugmenter("http://localhost:11434/api/generate", model='tinyllama:1.1b')

Initialize DatasetAugmenter:
```
from augmenter import DatasetAugmenter
dataset_augmenter = DatasetAugmenter(augmenter=augmenter, dataset=dataset)
```
After the process is finished, the dataset will be saved in the 'augmented_dataset.jsonl' file by default.

Generate the question and answer pairs:

Optionally, filter the augmented dataset:

dataset_augmenter.filter_dataset(cosine_similarity_threshold=0.45, cross_cosine_similarity_threshold=0.85)

This will automatically process the embeddings and filter the dataset based on the set thresholds. Alternatively it can be done manually:

dataset_augmenter.get_embeddings()
dataset_augmenter.get_cosine_similarity()
dataset_augmenter.get_cross_cosine_similarity()
dataset_augmenter.filter_dataset(cosine_similarity_threshold=0.45, cross_cosine_similarity_threshold=0.85)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.4

Jun 13, 2025

0.0.3

Jun 13, 2025

0.0.2

Jun 13, 2025

0.0.1

Jun 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python_data_augmenter-0.0.4.tar.gz (1.5 MB view details)

Uploaded Jun 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

python_data_augmenter-0.0.4-py3-none-any.whl (12.4 kB view details)

Uploaded Jun 13, 2025 Python 3

File details

Details for the file python_data_augmenter-0.0.4.tar.gz.

File metadata

Download URL: python_data_augmenter-0.0.4.tar.gz
Upload date: Jun 13, 2025
Size: 1.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for python_data_augmenter-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`cb5b616532fb2c91eef965dae91ee037257d7245d6b93bae085cf6b85c8e4a65`
MD5	`b1521c600fad5c6db8fc538e0941d48a`
BLAKE2b-256	`75e7e503c7d619a9d58c61bb9557d89b2922ec0fa35272ee379d1fcf7e9ecda3`

See more details on using hashes here.

File details

Details for the file python_data_augmenter-0.0.4-py3-none-any.whl.

File metadata

Download URL: python_data_augmenter-0.0.4-py3-none-any.whl
Upload date: Jun 13, 2025
Size: 12.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for python_data_augmenter-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf2123fe4924f64cd2da9ef518c0075eb15a3db76935b7e1362aad0f9c247c19`
MD5	`9837c7d7513b5eafd31a4a8f348d584e`
BLAKE2b-256	`2143875f88c47842a330dd51db45ed60f9f97a56400dca00b2df750a45a03b5c`

See more details on using hashes here.

python-data-augmenter 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data Augmenter

Installation

Modules

Document Chunker

Usage

Augmentation

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes