Skip to main content

No project description provided

Project description

中文標點符號標注

訓練資料集: p208p2002/ZH-Wiki-Punctuation-Restore-Dataset

共計支援6種標點符號: , 、 。 ? ! ;

安裝

# pip install torch pytorch-lightning
pip install zhpr

使用

from zhpr.predict import DocumentDataset,merge_stride,decode_pred
from transformers import AutoModelForTokenClassification,AutoTokenizer
from torch.utils.data import DataLoader

def predict_step(batch,model,tokenizer):
        assert batch.shape[0]==1
        out = []
        input_ids = batch
        encodings = {'input_ids': input_ids}
        output = model(**encodings)

        predicted_token_class_id_batch = output['logits'].argmax(-1)
        for predicted_token_class_ids, input_ids in zip(predicted_token_class_id_batch, input_ids):
            tokens = tokenizer.convert_ids_to_tokens(input_ids)
            
            # compute the pad start in input_ids
            # and also truncate the predict
            input_ids = input_ids.tolist()
            try:
                input_id_pad_start = input_ids.index(tokenizer.pad_token_id)
            except:
                input_id_pad_start = len(input_ids)
            input_ids = input_ids[:input_id_pad_start]
            tokens = tokens[:input_id_pad_start]
    
            # predicted_token_class_ids
            predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids]
            predicted_tokens_classes = predicted_tokens_classes[:input_id_pad_start]

            for token,ner in zip(tokens,predicted_tokens_classes):
                out.append((token,ner))
        return out

if __name__ == "__main__":
    window_size = 100
    step = 75
    text = "維基百科是維基媒體基金會運營的一個多語言的線上百科全書並以建立和維護作為開放式協同合作專案特點是自由內容自由編輯自由著作權目前是全球網路上最大且最受大眾歡迎的參考工具書名列全球二十大最受歡迎的網站其在搜尋引擎中排名亦較為靠前維基百科目前由非營利組織維基媒體基金會負責營運"
    dataset = DocumentDataset(text,window_size=window_size,step=step)
    dataloader = DataLoader(dataset=dataset,shuffle=False,batch_size=1)

    model_name = 'p208p2002/zh-wiki-punctuation-restore'
    model = AutoModelForTokenClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    model_pred_out = []
    for batch in dataloader:
        model_pred_out.append(predict_step(batch,model,tokenizer))
        
    merge_pred_result = merge_stride(model_pred_out,step)
    merge_pred_result_deocde = decode_pred(merge_pred_result)
    merge_pred_result_deocde = ''.join(merge_pred_result_deocde)
    print(merge_pred_result_deocde)
維基百科是維基媒體基金會運營的一個多語言的線上百科全書,並以建立和維護作為開放式協同合作。專案特點是自由內容、自由編輯、自由著作權。目前是全球網路上最大且最受大眾歡迎的參考工具書,名列全球二十大最受歡迎的網站,其在搜尋引擎中排名亦較為靠前。維基百科目前由非營利組織維基媒體基金會負責營運。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhpr-0.1.1.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zhpr-0.1.1-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file zhpr-0.1.1.tar.gz.

File metadata

  • Download URL: zhpr-0.1.1.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.9.13 Linux/5.15.0-43-generic

File hashes

Hashes for zhpr-0.1.1.tar.gz
Algorithm Hash digest
SHA256 925297a7dd61faaba1682fde109b8856ddf669869605ad70f3aaaf9552bbf578
MD5 f696a8b3f8ffaadb3197b86321f2f654
BLAKE2b-256 9c0022a2b444f4a346ed5130bd7be7860a28fd8096f38220e96a84bd4b9cd772

See more details on using hashes here.

File details

Details for the file zhpr-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: zhpr-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 5.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.9.13 Linux/5.15.0-43-generic

File hashes

Hashes for zhpr-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d571792a79598df7d8e9b650aee1e206ece84c555ea05b4d5190b0e163e6fb64
MD5 5852d0cddcca9de1df9e5051b215f5e9
BLAKE2b-256 c6ee3e086cd48b2c61ee6dd73475c36050ea07aa1c23f8101b35deee38755b53

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page