Skip to main content

Data augmentation for NLP

Project description

SC4001 NLarge

Purpose of Project

NLarge is a project focused on exploring and implementing various data augmentation techniques for Natural Language Processing (NLP) tasks. The primary goal is to enhance the diversity and robustness of training datasets, thereby improving the performance and generalization capabilities of NLP models. This project includes traditional data augmentation methods such as synonym replacement and random substitution, as well as advanced techniques using Large Language Models (LLMs).

Initializing Virtual Environment

We use Poetry in this project for dependency management. To get started, you will need to install Poetry.

pip install poetry

Afterwards, you can install the needed packages from Python with the help of Poetry using the command below:

poetry install

Repository Contents

Usage

To run the models and experiments, you can use the python notebooks in the example/ directory. The notebooks contain detailed explanations and code snippets for data augmentation and model training. For the results of the experiments, you can refer to the example/test/ directory.

We also refer the user to demo_attention.ipynb for a more detailed example of how to use the pipeline.py module. The notebook contains the code for training a model with attention mechanism using the NLarge library as a toolkit for data augmentation.

Compute limitation

Should you face computational limitation, you can use the datasets that we have preprocessed and saved in the example/llm-dataset/ directory. As the inference time for the Large Language Models (LLMs) can be quite long, we have preprocessed in advance such that end users can directly use the preprocessed datasets for training and testing purposes.

Development

While the library has been developed and tested, the library can be easily extended with additional data augmentation techniques or with new models to support the testing and research of the performance of different augmentation techniques.

The library can be easily extended with additional data augmentation techniques through creation of new modules or files in the NLarge package.

Website

You can access the PiPy page of the project from the link here: pypi page

Our github repository can be found here: github page

Contributing

Contributions to this project are welcome. If you have any suggestions or improvements, please create a pull request or open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlarge-1.0.0.tar.gz (25.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlarge-1.0.0-py3-none-any.whl (19.6 kB view details)

Uploaded Python 3

File details

Details for the file nlarge-1.0.0.tar.gz.

File metadata

  • Download URL: nlarge-1.0.0.tar.gz
  • Upload date:
  • Size: 25.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.13.0 Darwin/24.0.0

File hashes

Hashes for nlarge-1.0.0.tar.gz
Algorithm Hash digest
SHA256 6e442ada09d1ed507168396c76f202d742de4cf0c402bdadc01435df89c210a6
MD5 add5308f50b1f0f567394b573dc2fcc8
BLAKE2b-256 8841d289d01d6067aafaf166c7a0b8beea11d9cd57222af512bc7aa04e381e46

See more details on using hashes here.

File details

Details for the file nlarge-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: nlarge-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 19.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.13.0 Darwin/24.0.0

File hashes

Hashes for nlarge-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 02e759184f0aeeeb890559f861ff1c50ea0593dc6ac77fe08c7dce93a90c0d54
MD5 e1eedc976333a62438ceb563d15aca2d
BLAKE2b-256 4a526efbed5c0a582aa9d9be14413a9d1ea746c0115ad8a0c4e81f469fe5f3c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page