A Python package for preprocessing and augmenting data for large language models by quantum Neural networks.
Project description
Of course, here is the detailed example of the README.md
file, which describes the functionality, installation method, and usage examples of the qullm
package.
# QULLM
QUllm is a Python package for preprocessing and augmenting data for large language models using quantum neural networks.
## Features
- **Data Loading**: Load data from various file formats.
- **Data Preprocessing**: Clean, normalize, and filter text data.
- **Data Augmentation**: Enhance data with techniques like synonym replacement, random insertion, deletion, and swapping.
- **Data Analysis**: Analyze data to get word frequencies, unique words, and other statistics.
## Installation
You can install the package using pip:
```bash
pip install qullm
Usage
Here is an example of how to use the qullm
package:
from qullm.llm_data_processor import DataLoader, DataPreprocessor, DataAugmenter, DataAnalyzer, save_data
# Specify the path to your data file
data_file_path = 'path/to/your/data.txt'
processed_data_file_path = 'path/to/save/processed_data.txt'
# Step 1: Load data
data_loader = DataLoader(data_file_path)
data = data_loader.load_data()
print("Original Data:")
print(data[:5]) # Print first 5 lines for inspection
# Step 2: Preprocess data
preprocessor = DataPreprocessor(data)
cleaned_data = preprocessor.clean_data()
normalized_data = preprocessor.normalize_data()
filtered_data = preprocessor.filter_data()
print("Cleaned Data:")
print(cleaned_data[:5]) # Print first 5 lines for inspection
print("Normalized Data:")
print(normalized_data[:5]) # Print first 5 lines for inspection
print("Filtered Data:")
print(filtered_data[:5]) # Print first 5 lines for inspection
# Step 3: Augment data
augmenter = DataAugmenter(filtered_data)
augmented_data = augmenter.augment_data()
print("Augmented Data:")
print(augmented_data[:5]) # Print first 5 lines for inspection
# Step 4: Analyze data
analyzer = DataAnalyzer(augmented_data)
word_freq = analyzer.get_word_frequency()
stats = analyzer.get_data_statistics()
print("Word Frequency:")
print(word_freq.most_common(10)) # Print top 10 most common words
print("Data Statistics:")
print(stats)
# Step 5: Save processed data
save_data(augmented_data, processed_data_file_path)
print(f"Processed data saved to {processed_data_file_path}")
Modules
DataLoader
A module for loading data from files.
Methods
load_data()
: Load data from a file and return as a list of strings.load_data_as_string()
: Load data from a file and return as a single string.load_data_as_lines()
: Load data from a file and return as a list of lines.
DataPreprocessor
A module for preprocessing text data.
Methods
clean_data()
: Clean the data by removing extra whitespace and newlines.normalize_data()
: Normalize the data by converting text to lowercase.filter_data(min_length)
: Filter out lines with fewer words thanmin_length
.remove_special_characters()
: Remove special characters from the data.remove_stopwords(stopwords)
: Remove stopwords from the data.
DataAugmenter
A module for augmenting text data.
Methods
augment_data(augment_factor)
: Augment the data by duplicating and shuffling.synonym_replacement(word_map)
: Replace words with their synonyms.random_insertion(insertion_words, insert_prob)
: Randomly insert words into the data.random_deletion(delete_prob)
: Randomly delete words from the data.random_swap(swap_prob)
: Randomly swap words in the data.
DataAnalyzer
A module for analyzing text data.
Methods
get_word_frequency()
: Get the frequency of words in the data.get_data_statistics()
: Get statistics such as the number of lines and words.get_unique_words()
: Get unique words in the data.get_average_line_length()
: Get the average length of lines in the data.get_most_common_words(n)
: Get the topn
most common words in the data.
Utils
Utility functions for saving data and loading configurations.
Methods
save_data(data, file_path)
: Save data to a specified file path.load_config(config_path)
: Load configuration from a JSON file.save_config(config, config_path)
: Save configuration to a JSON file.split_data(data, train_ratio)
: Split data into training and testing sets.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgements
Thanks to the contributors and the open-source community for their support and contributions.
### Explanation
1. **Features**:
- Briefly introduces the main functionalities of the package, including data loading, preprocessing, augmentation, and analysis.
2. **Installation**:
- Provides the command to install the package.
3. **Usage**:
- Offers a complete usage example demonstrating how to load, preprocess, augment, analyze, and save data.
4. **Modules**:
- Detailed introduction of each module and its methods, including `DataLoader`, `DataPreprocessor`, `DataAugmenter`, `DataAnalyzer`, and `Utils`.
5. **License**:
- Provides information about the project's license.
6. **Acknowledgements**:
- Thanks the contributors and the open-source community for their support and contributions.
In this way, users can easily understand the functionality and usage of the `qullm` package to better preprocess and augment large language data.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file qullm-0.3.0.tar.gz
.
File metadata
- Download URL: qullm-0.3.0.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 48b67e1e2da702777d0bfac2e08fc0c48061cf2149662b1a901c7d1e441cab47 |
|
MD5 | b1982b024ab0d2d166ff8ac7fc43f07e |
|
BLAKE2b-256 | 5e73204232dd332c2561491f1939a7ad46755762fab5be02b71aaa6137d9a82b |
File details
Details for the file qullm-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: qullm-0.3.0-py3-none-any.whl
- Upload date:
- Size: 11.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06b001da877da22efa2bcb5aa458f0afb7cabc36af983a713596ace6333a1388 |
|
MD5 | c92f4ea7162eadc88f91eee32e319280 |
|
BLAKE2b-256 | a9f40236b8874bb5cd80abc1cc3001fde3529083361d5be92089be0d8c0b61a9 |