No project description provided

These details have not been verified by PyPI

Project description

中文標點符號標注

訓練資料集: p208p2002/ZH-Wiki-Punctuation-Restore-Dataset

共計支援6種標點符號: ，、。？！；

安裝

# pip install torch pytorch-lightning
pip install zhpr

使用

from zhpr.predict import DocumentDataset,merge_stride,decode_pred
from transformers import AutoModelForTokenClassification,AutoTokenizer
from torch.utils.data import DataLoader

def predict_step(batch,model,tokenizer):
        batch_out = []
        batch_input_ids = batch

        encodings = {'input_ids': batch_input_ids}
        output = model(**encodings)

        predicted_token_class_id_batch = output['logits'].argmax(-1)
        for predicted_token_class_ids, input_ids in zip(predicted_token_class_id_batch, batch_input_ids):
            out=[]
            tokens = tokenizer.convert_ids_to_tokens(input_ids)
            
            # compute the pad start in input_ids
            # and also truncate the predict
            # print(tokenizer.decode(batch_input_ids))
            input_ids = input_ids.tolist()
            try:
                input_id_pad_start = input_ids.index(tokenizer.pad_token_id)
            except:
                input_id_pad_start = len(input_ids)
            input_ids = input_ids[:input_id_pad_start]
            tokens = tokens[:input_id_pad_start]
    
            # predicted_token_class_ids
            predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids]
            predicted_tokens_classes = predicted_tokens_classes[:input_id_pad_start]

            for token,ner in zip(tokens,predicted_tokens_classes):
                out.append((token,ner))
            batch_out.append(out)
        return batch_out

if __name__ == "__main__":
    window_size = 256
    step = 200
    text = "維基百科是維基媒體基金會運營的一個多語言的百科全書目前是全球網路上最大且最受大眾歡迎的參考工具書名列全球二十大最受歡迎的網站特點是自由內容自由編輯與自由著作權"
    dataset = DocumentDataset(text,window_size=window_size,step=step)
    dataloader = DataLoader(dataset=dataset,shuffle=False,batch_size=5)

    model_name = 'p208p2002/zh-wiki-punctuation-restore'
    model = AutoModelForTokenClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    model_pred_out = []
    for batch in dataloader:
        batch_out = predict_step(batch,model,tokenizer)
        for out in batch_out:
            model_pred_out.append(out)
        
    merge_pred_result = merge_stride(model_pred_out,step)
    merge_pred_result_deocde = decode_pred(merge_pred_result)
    merge_pred_result_deocde = ''.join(merge_pred_result_deocde)
    print(merge_pred_result_deocde)

維基百科是維基媒體基金會運營的一個多語言的百科全書，目前是全球網路上最大且最受大眾歡迎的參考工具書，名列全球二十大最受歡迎的網站，特點是自由內容、自由編輯與自由著作權。

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

Jan 31, 2023

0.1.2

Jan 31, 2023

0.1.1

Jan 31, 2023

0.1.0

Jan 31, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhpr-0.1.3.tar.gz (4.5 kB view details)

Uploaded Jan 31, 2023 Source

Built Distribution

zhpr-0.1.3-py3-none-any.whl (4.8 kB view details)

Uploaded Jan 31, 2023 Python 3

File details

Details for the file zhpr-0.1.3.tar.gz.

File metadata

Download URL: zhpr-0.1.3.tar.gz
Upload date: Jan 31, 2023
Size: 4.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.3.2 CPython/3.9.13 Linux/5.15.0-43-generic

File hashes

Hashes for zhpr-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`cb0a2fe68df25fad80091b1b99c1735eb482ec74b0815eebe45976ec858623d5`
MD5	`d32cec2625f76b6ba802ee47e171a53c`
BLAKE2b-256	`6806721617f9d7bd9707a46ec9cb21b730297b828f05cee104c0656a32a9a7fb`

See more details on using hashes here.

File details

Details for the file zhpr-0.1.3-py3-none-any.whl.

File metadata

Download URL: zhpr-0.1.3-py3-none-any.whl
Upload date: Jan 31, 2023
Size: 4.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.3.2 CPython/3.9.13 Linux/5.15.0-43-generic

File hashes

Hashes for zhpr-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`35305b39043b4b27600c983648dcb3dbb2dc17fa2dcb7e4424c1c3886d0e140f`
MD5	`cdd8343ef1dc60cdea96906ffe1666a4`
BLAKE2b-256	`ba49eb877bb066a328d8bd1934c4657d9cc2aa8e3dea1c2f3b86912a0b826453`

See more details on using hashes here.

zhpr 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

中文標點符號標注

安裝

使用

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes