No project description provided
Project description
中文標點符號標注
訓練資料集: p208p2002/ZH-Wiki-Punctuation-Restore-Dataset
共計支援6種標點符號: , 、 。 ? ! ;
安裝
# pip install torch pytorch-lightning
pip install zhpr
使用
from zhpr.predict import DocumentDataset,merge_stride,decode_pred
from transformers import AutoModelForTokenClassification,AutoTokenizer
from torch.utils.data import DataLoader
def predict_step(batch,model,tokenizer):
batch_out = []
batch_input_ids = batch
encodings = {'input_ids': batch_input_ids}
output = model(**encodings)
predicted_token_class_id_batch = output['logits'].argmax(-1)
for predicted_token_class_ids, input_ids in zip(predicted_token_class_id_batch, batch_input_ids):
out=[]
tokens = tokenizer.convert_ids_to_tokens(input_ids)
# compute the pad start in input_ids
# and also truncate the predict
# print(tokenizer.decode(batch_input_ids))
input_ids = input_ids.tolist()
try:
input_id_pad_start = input_ids.index(tokenizer.pad_token_id)
except:
input_id_pad_start = len(input_ids)
input_ids = input_ids[:input_id_pad_start]
tokens = tokens[:input_id_pad_start]
# predicted_token_class_ids
predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids]
predicted_tokens_classes = predicted_tokens_classes[:input_id_pad_start]
for token,ner in zip(tokens,predicted_tokens_classes):
out.append((token,ner))
batch_out.append(out)
return batch_out
if __name__ == "__main__":
window_size = 256
step = 200
text = "維基百科是維基媒體基金會運營的一個多語言的百科全書目前是全球網路上最大且最受大眾歡迎的參考工具書名列全球二十大最受歡迎的網站特點是自由內容自由編輯與自由著作權"
dataset = DocumentDataset(text,window_size=window_size,step=step)
dataloader = DataLoader(dataset=dataset,shuffle=False,batch_size=5)
model_name = 'p208p2002/zh-wiki-punctuation-restore'
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_pred_out = []
for batch in dataloader:
batch_out = predict_step(batch,model,tokenizer)
for out in batch_out:
model_pred_out.append(out)
merge_pred_result = merge_stride(model_pred_out,step)
merge_pred_result_deocde = decode_pred(merge_pred_result)
merge_pred_result_deocde = ''.join(merge_pred_result_deocde)
print(merge_pred_result_deocde)
維基百科是維基媒體基金會運營的一個多語言的百科全書,目前是全球網路上最大且最受大眾歡迎的參考工具書,名列全球二十大最受歡迎的網站,特點是自由內容、自由編輯與自由著作權。
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
zhpr-0.1.3.tar.gz
(4.5 kB
view details)
Built Distribution
zhpr-0.1.3-py3-none-any.whl
(4.8 kB
view details)
File details
Details for the file zhpr-0.1.3.tar.gz
.
File metadata
- Download URL: zhpr-0.1.3.tar.gz
- Upload date:
- Size: 4.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.9.13 Linux/5.15.0-43-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb0a2fe68df25fad80091b1b99c1735eb482ec74b0815eebe45976ec858623d5 |
|
MD5 | d32cec2625f76b6ba802ee47e171a53c |
|
BLAKE2b-256 | 6806721617f9d7bd9707a46ec9cb21b730297b828f05cee104c0656a32a9a7fb |
File details
Details for the file zhpr-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: zhpr-0.1.3-py3-none-any.whl
- Upload date:
- Size: 4.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.9.13 Linux/5.15.0-43-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35305b39043b4b27600c983648dcb3dbb2dc17fa2dcb7e4424c1c3886d0e140f |
|
MD5 | cdd8343ef1dc60cdea96906ffe1666a4 |
|
BLAKE2b-256 | ba49eb877bb066a328d8bd1934c4657d9cc2aa8e3dea1c2f3b86912a0b826453 |