Skip to main content

No project description provided

Project description

中文標點符號標注

訓練資料集: p208p2002/ZH-Wiki-Punctuation-Restore-Dataset

共計支援6種標點符號: , 、 。 ? ! ;

安裝

# pip install torch pytorch-lightning
pip install zhpr

使用

from zhpr.predict import DocumentDataset,merge_stride,decode_pred
from transformers import AutoModelForTokenClassification,AutoTokenizer
from torch.utils.data import DataLoader

def predict_step(batch,model,tokenizer):
        batch_out = []
        batch_input_ids = batch

        encodings = {'input_ids': batch_input_ids}
        output = model(**encodings)

        predicted_token_class_id_batch = output['logits'].argmax(-1)
        for predicted_token_class_ids, input_ids in zip(predicted_token_class_id_batch, batch_input_ids):
            out=[]
            tokens = tokenizer.convert_ids_to_tokens(input_ids)
            
            # compute the pad start in input_ids
            # and also truncate the predict
            # print(tokenizer.decode(batch_input_ids))
            input_ids = input_ids.tolist()
            try:
                input_id_pad_start = input_ids.index(tokenizer.pad_token_id)
            except:
                input_id_pad_start = len(input_ids)
            input_ids = input_ids[:input_id_pad_start]
            tokens = tokens[:input_id_pad_start]
    
            # predicted_token_class_ids
            predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids]
            predicted_tokens_classes = predicted_tokens_classes[:input_id_pad_start]

            for token,ner in zip(tokens,predicted_tokens_classes):
                out.append((token,ner))
            batch_out.append(out)
        return batch_out

if __name__ == "__main__":
    window_size = 256
    step = 200
    text = "維基百科是維基媒體基金會運營的一個多語言的百科全書目前是全球網路上最大且最受大眾歡迎的參考工具書名列全球二十大最受歡迎的網站特點是自由內容自由編輯與自由著作權"
    dataset = DocumentDataset(text,window_size=window_size,step=step)
    dataloader = DataLoader(dataset=dataset,shuffle=False,batch_size=5)

    model_name = 'p208p2002/zh-wiki-punctuation-restore'
    model = AutoModelForTokenClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    model_pred_out = []
    for batch in dataloader:
        batch_out = predict_step(batch,model,tokenizer)
        for out in batch_out:
            model_pred_out.append(out)
        
    merge_pred_result = merge_stride(model_pred_out,step)
    merge_pred_result_deocde = decode_pred(merge_pred_result)
    merge_pred_result_deocde = ''.join(merge_pred_result_deocde)
    print(merge_pred_result_deocde)
維基百科是維基媒體基金會運營的一個多語言的百科全書,目前是全球網路上最大且最受大眾歡迎的參考工具書,名列全球二十大最受歡迎的網站,特點是自由內容、自由編輯與自由著作權。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhpr-0.1.3.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

zhpr-0.1.3-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file zhpr-0.1.3.tar.gz.

File metadata

  • Download URL: zhpr-0.1.3.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.9.13 Linux/5.15.0-43-generic

File hashes

Hashes for zhpr-0.1.3.tar.gz
Algorithm Hash digest
SHA256 cb0a2fe68df25fad80091b1b99c1735eb482ec74b0815eebe45976ec858623d5
MD5 d32cec2625f76b6ba802ee47e171a53c
BLAKE2b-256 6806721617f9d7bd9707a46ec9cb21b730297b828f05cee104c0656a32a9a7fb

See more details on using hashes here.

File details

Details for the file zhpr-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: zhpr-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 4.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.9.13 Linux/5.15.0-43-generic

File hashes

Hashes for zhpr-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 35305b39043b4b27600c983648dcb3dbb2dc17fa2dcb7e4424c1c3886d0e140f
MD5 cdd8343ef1dc60cdea96906ffe1666a4
BLAKE2b-256 ba49eb877bb066a328d8bd1934c4657d9cc2aa8e3dea1c2f3b86912a0b826453

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page