A small seq2seq punctuator tool based on DistilBERT
Project description
Distilbert-punctuator
Introduction
Distilbert-punctuator is a python package provides a bert-based punctuator (fine-tuned model of pretrained huggingface DistilBertForTokenClassification
) with following three components:
- data process: funcs for processing user's data to prepare for training. If user perfer to fine-tune the model with his/her own data.
- training: training pipeline. User can fine-tune his/her own punctuator with the pipeline
- inference: easy-to-use interface for user to use trained punctuator. If user doesn't want to train a punctuator himself/herself, a pre-fined-tuned model from huggingface model hub
Qishuai/distilbert_punctuator_en
can be used when launching the inference
Data Process
Component for pre-processing the training data. To use this component, please install as pip install distilbert-punctuator[data_process]
The package is providing a simple pipeline for you to generate NER
format training data.
Example
examples/data_sample.py
Train
Component for providing a training pipeline for fine-tuning a pretrained DistilBertForTokenClassification
model from huggingface
.
Example
examples/train_sample.py
Training_arguments:
Arguments required for the training pipeline.
data_file_path(str)
: path of training data
model_name(str)
: name or path of pre-trained model
tokenizer_name(str)
: name of pretrained tokenizer
split_rate(float)
: train and validation split rate
sequence_length(int)
: sequence length of one sample
epoch(int)
: number of epoch
batch_size(int)
: batch size
model_storage_path(str)
: fine-tuned model storage path
tag2id_storage_path(str)
: tag2id storage path
addtional_model_config(Optional[Dict])
: additional configuration for model
Inference
Component for providing an inference interface for user to use punctuator.
Architecture
+----------------------+ (child process)
| user application | +-------------------+
+ + <---------->| punctuator server |
| +inference object | +-------------------+
+----------------------+
The punctuator will be deployed in a child process which communicates with main process through pipe connection.
Therefore user can initialize an inference object and call its punctuation
function when needed. The punctuator will never block the main process unless doing punctuation.
There is a graceful shutdown
methodology for the punctuator, hence user dosen't need to worry about the shutting-down.
Example
examples/inference_sample.py
Inference_arguments
Arguments required for the inference pipeline.
model_name_or_path(str)
: name or path of pre-trained model
tokenizer_name(str)
: name of pretrained tokenizer
tag2id_storage_path(Optional[str])
: tag2id storage path. If None, DEFAULT_TAG_ID will be used.
DEFAULT_TAG_ID
: {"E": 0, "O": 1, "P": 2, "C": 3, "Q": 4}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for distilbert-punctuator-0.1.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | e0cd37c060cfdfc9ddc660741799a076cce0c080d89c662a7bf26f08e760e011 |
|
MD5 | f8d850d515979115f46e5a373adaaeb3 |
|
BLAKE2b-256 | 5a7e4ebe9555790fbea574d79ce60765ba4c12a00d7ecd1528275a9905aea9c9 |
Hashes for distilbert_punctuator-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1de3993eae9e722be7cc55e51b955efe2409108083017ec66b3887547cdfe820 |
|
MD5 | 66402e78a4e795a38cca646d32e0082e |
|
BLAKE2b-256 | f368970a8908c06dbf107748c2c12a8e01b0cce22502457363015f908d812609 |