Skip to main content

A Python package for preprocessing and augmenting data for large language models by quantum Neural networks.

Project description

Of course, here is the detailed example of the README.md file, which describes the functionality, installation method, and usage examples of the qullm package.

# QULLM

QUllm is a Python package for preprocessing and augmenting data for large language models using quantum neural networks.

## Features

- **Data Loading**: Load data from various file formats.
- **Data Preprocessing**: Clean, normalize, and filter text data.
- **Data Augmentation**: Enhance data with techniques like synonym replacement, random insertion, deletion, and swapping.
- **Data Analysis**: Analyze data to get word frequencies, unique words, and other statistics.

## Installation

You can install the package using pip:

```bash
pip install qullm

Usage

Here is an example of how to use the qullm package:

from qullm.llm_data_processor import DataLoader, DataPreprocessor, DataAugmenter, DataAnalyzer, save_data

# Specify the path to your data file
data_file_path = 'path/to/your/data.txt'
processed_data_file_path = 'path/to/save/processed_data.txt'

# Step 1: Load data
data_loader = DataLoader(data_file_path)
data = data_loader.load_data()
print("Original Data:")
print(data[:5])  # Print first 5 lines for inspection

# Step 2: Preprocess data
preprocessor = DataPreprocessor(data)
cleaned_data = preprocessor.clean_data()
normalized_data = preprocessor.normalize_data()
filtered_data = preprocessor.filter_data()

print("Cleaned Data:")
print(cleaned_data[:5])  # Print first 5 lines for inspection

print("Normalized Data:")
print(normalized_data[:5])  # Print first 5 lines for inspection

print("Filtered Data:")
print(filtered_data[:5])  # Print first 5 lines for inspection

# Step 3: Augment data
augmenter = DataAugmenter(filtered_data)
augmented_data = augmenter.augment_data()
print("Augmented Data:")
print(augmented_data[:5])  # Print first 5 lines for inspection

# Step 4: Analyze data
analyzer = DataAnalyzer(augmented_data)
word_freq = analyzer.get_word_frequency()
stats = analyzer.get_data_statistics()

print("Word Frequency:")
print(word_freq.most_common(10))  # Print top 10 most common words

print("Data Statistics:")
print(stats)

# Step 5: Save processed data
save_data(augmented_data, processed_data_file_path)
print(f"Processed data saved to {processed_data_file_path}")

Modules

DataLoader

A module for loading data from files.

Methods

  • load_data(): Load data from a file and return as a list of strings.
  • load_data_as_string(): Load data from a file and return as a single string.
  • load_data_as_lines(): Load data from a file and return as a list of lines.

DataPreprocessor

A module for preprocessing text data.

Methods

  • clean_data(): Clean the data by removing extra whitespace and newlines.
  • normalize_data(): Normalize the data by converting text to lowercase.
  • filter_data(min_length): Filter out lines with fewer words than min_length.
  • remove_special_characters(): Remove special characters from the data.
  • remove_stopwords(stopwords): Remove stopwords from the data.

DataAugmenter

A module for augmenting text data.

Methods

  • augment_data(augment_factor): Augment the data by duplicating and shuffling.
  • synonym_replacement(word_map): Replace words with their synonyms.
  • random_insertion(insertion_words, insert_prob): Randomly insert words into the data.
  • random_deletion(delete_prob): Randomly delete words from the data.
  • random_swap(swap_prob): Randomly swap words in the data.

DataAnalyzer

A module for analyzing text data.

Methods

  • get_word_frequency(): Get the frequency of words in the data.
  • get_data_statistics(): Get statistics such as the number of lines and words.
  • get_unique_words(): Get unique words in the data.
  • get_average_line_length(): Get the average length of lines in the data.
  • get_most_common_words(n): Get the top n most common words in the data.

Utils

Utility functions for saving data and loading configurations.

Methods

  • save_data(data, file_path): Save data to a specified file path.
  • load_config(config_path): Load configuration from a JSON file.
  • save_config(config, config_path): Save configuration to a JSON file.
  • split_data(data, train_ratio): Split data into training and testing sets.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

Thanks to the contributors and the open-source community for their support and contributions.


### Explanation

1. **Features**:
   - Briefly introduces the main functionalities of the package, including data loading, preprocessing, augmentation, and analysis.

2. **Installation**:
   - Provides the command to install the package.

3. **Usage**:
   - Offers a complete usage example demonstrating how to load, preprocess, augment, analyze, and save data.

4. **Modules**:
   - Detailed introduction of each module and its methods, including `DataLoader`, `DataPreprocessor`, `DataAugmenter`, `DataAnalyzer`, and `Utils`.

5. **License**:
   - Provides information about the project's license.

6. **Acknowledgements**:
   - Thanks the contributors and the open-source community for their support and contributions.

In this way, users can easily understand the functionality and usage of the `qullm` package to better preprocess and augment large language data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qullm-0.3.0.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

qullm-0.3.0-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file qullm-0.3.0.tar.gz.

File metadata

  • Download URL: qullm-0.3.0.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.0

File hashes

Hashes for qullm-0.3.0.tar.gz
Algorithm Hash digest
SHA256 48b67e1e2da702777d0bfac2e08fc0c48061cf2149662b1a901c7d1e441cab47
MD5 b1982b024ab0d2d166ff8ac7fc43f07e
BLAKE2b-256 5e73204232dd332c2561491f1939a7ad46755762fab5be02b71aaa6137d9a82b

See more details on using hashes here.

File details

Details for the file qullm-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: qullm-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 11.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.0

File hashes

Hashes for qullm-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 06b001da877da22efa2bcb5aa458f0afb7cabc36af983a713596ace6333a1388
MD5 c92f4ea7162eadc88f91eee32e319280
BLAKE2b-256 a9f40236b8874bb5cd80abc1cc3001fde3529083361d5be92089be0d8c0b61a9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page