To improve EDU segmentation performance using Segbot. As Segbot has an encoder-decoder model architecture, we can replace bidirectional GRU encoder with generative pretraining models such as BART and T5. Evaluate the new model using the RST dataset by using few-shot based settings (e.g. 100 examples) to train the model, instead of using the full dataset.
Project description
Final Year Project on EDU Segmentation:
To improve EDU segmentation performance using Segbot. As Segbot has an encoder-decoder model architecture, we can replace bidirectional GRU encoder with generative pretraining models such as BART and T5. Evaluate the new model using the RST dataset by using few-shot based settings (e.g. 100 examples) to train the model, instead of using the full dataset.
Segbot:
http://138.197.118.157:8000/segbot/
https://www.ijcai.org/proceedings/2018/0579.pdf
Installation
To use the EDUSegmentation module, follow these steps:
- Import the
download
module to download all models:
from edu_segmentation import download
download.download_models()
- Import the
edu_segmentation
module and its related classes
from edu_segmentation.main import EDUSegmentation, ModelFactory, BERTUncasedModel, BERTCasedModel, BARTModel
Usage
The edu_segmentation module provides an easy-to-use interface to perform EDU segmentation using different strategies and models. Follow these steps to use it:
- Create a segmentation strategy:
You can choose between the default segmentation strategy or a conjunction-based segmentation strategy.
Conjunction-based segmentation strategy: After the text has been EDU-segmented, if there are conjunctions at the start or end of each segment, the conjunctions will be isolated as its own segment.
Default segmentation strategy: No post-processing occurs after the text has been EDU-segmented
from edu_segmentation import DefaultSegmentation, ConjunctionSegmentation
- Create a model using the
ModelFactory
.
Choose from BERT Uncased, BERT Cased, or BART models.
model_type = "bert_uncased" # or "bert_cased", "bart"
model = ModelFactory.create_model(model_type)
- create an instance of
EDUSegmentation
using the chosen model:
edu_segmenter = EDUSegmentation(model)
- Segment the text using the chosen strategy:
text = "Your input text here."
granularity = "conjunction_words" # or "default"
conjunctions = ["and", "but", "however"] # Customize conjunctions if needed
device = 'cpu' # Choose your device, e.g., 'cuda:0'
segmented_output = edu_segmenter.run(text, granularity, conjunctions, device)
Example
Here's a simple example demonstrating how to use the edu_segmentation module:
from EDUSegmentation import EDUSegmentation, ModelFactory, BERTUncasedModel, ConjunctionSegmentation
# Create a BERT Uncased model
model = ModelFactory.create_model("bert_uncased")
# Create an instance of EDUSegmentation using the model
edu_segmenter = EDUSegmentation(model)
# Segment the text using the conjunction-based segmentation strategy
text = "The food is good, but the service is bad."
granularity = "conjunction_words"
conjunctions = ["and", "but", "however"]
device = 'cpu'
segmented_output = edu_segmenter.run(text, granularity, conjunctions, device)
print(segmented_output)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for edu_segmentation-0.0.110-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fecb28a7aeda994106fa5467c58bef25514388e03abe955c1d510f7f5b0598b9 |
|
MD5 | 0e66b9fa3ebb27d10ed7ccb4070d4527 |
|
BLAKE2b-256 | 24c54a645afe0260176a20466298813ec26f059919e93667abfdc820a93bbe01 |