To improve EDU segmentation performance using Segbot. As Segbot has an encoder-decoder model architecture, we can replace bidirectional GRU encoder with generative pretraining models such as BART and T5. Evaluate the new model using the RST dataset by using few-shot based settings (e.g. 100 examples) to train the model, instead of using the full dataset.
Project description
Final Year Project on EDU Segmentation:
To improve EDU segmentation performance using Segbot. As Segbot has an encoder-decoder model architecture, we can replace bidirectional GRU encoder with generative pretraining models such as BART and T5. Evaluate the new model using the RST dataset by using few-shot based settings (e.g. 100 examples) to train the model, instead of using the full dataset.
Segbot:
http://138.197.118.157:8000/segbot/
https://www.ijcai.org/proceedings/2018/0579.pdf
Installation
To use the EDUSegmentation module, follow these steps:
- Import the
download
module to download all models:
from edu_segmentation.download import download_models
download_models()
- Import the
edu_segmentation
module and its related classes
from edu_segmentation.main import EDUSegmentation, ModelFactory, BERTUncasedModel, BERTCasedModel, BARTModel
Usage
The edu_segmentation module provides an easy-to-use interface to perform EDU segmentation using different strategies and models. Follow these steps to use it:
- Create a segmentation strategy:
You can choose between the default segmentation strategy or a conjunction-based segmentation strategy.
Conjunction-based segmentation strategy: After the text has been EDU-segmented, if there are conjunctions at the start or end of each segment, the conjunctions will be isolated as its own segment.
Default segmentation strategy: No post-processing occurs after the text has been EDU-segmented
from edu_segmentation.main import DefaultSegmentation, ConjunctionSegmentation
- Create a model using the
ModelFactory
.
Choose from BERT Uncased, BERT Cased, or BART models.
model_type = "bert_uncased" # or "bert_cased", "bart"
model = ModelFactory.create_model(model_type)
- create an instance of
EDUSegmentation
using the chosen model:
edu_segmenter = EDUSegmentation(model)
- Segment the text using the chosen strategy:
text = "Your input text here."
granularity = "conjunction_words" # or "default"
conjunctions = ["and", "but", "however"] # Customize conjunctions if needed
device = 'cpu' # Choose your device, e.g., 'cuda:0'
segmented_output = edu_segmenter.run(text, granularity, conjunctions, device)
Example
Here's a simple example demonstrating how to use the edu_segmentation module:
from edu_segmentation.download import download_models
from edu_segmentation.main import ModelFactory, EDUSegmentation
download_models()
# Create a BERT Uncased model
model = ModelFactory.create_model("bart") # or bert_cased or bert_uncased
# Create an instance of EDUSegmentation using the model
edu_segmenter = EDUSegmentation(model)
# Segment the text using the conjunction-based segmentation strategy
text = "The food is good, but the service is bad."
granularity = "conjunction_words" # or default
conjunctions = ["and", "but", "however"] # customise as needed
device = 'cpu' # or cuda
segmented_output = edu_segmenter.run(text, granularity, conjunctions, device)
print(segmented_output)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for edu_segmentation-0.0.112-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7da04a0c737c7bf325bc28977daa9301bdaead4e6a4247cf471c9bca105e0fd1 |
|
MD5 | 19911b03ffd477f733ffc3a0d5dfaf67 |
|
BLAKE2b-256 | 692fd69c8188fd141cc17f60fa62670d96008370c0c90a90357e25009b615c62 |