Skip to main content

No project description provided

Project description

中文標點符號標注

訓練資料集: p208p2002/ZH-Wiki-Punctuation-Restore-Dataset

共計支援6種標點符號: , 、 。 ? ! ;

安裝

# pip install torch pytorch-lightning
pip install zhpr

使用

from zhpr.predict import DocumentDataset,merge_stride,decode_pred
from transformers import AutoModelForTokenClassification,AutoTokenizer
from torch.utils.data import DataLoader

def predict_step(batch,model,tokenizer):
        out = []
        input_ids = batch
        encodings = {'input_ids': input_ids}
        output = model(**encodings)

        predicted_token_class_id_batch = output['logits'].argmax(-1)
        for predicted_token_class_ids, input_ids in zip(predicted_token_class_id_batch, input_ids):
            tokens = tokenizer.convert_ids_to_tokens(input_ids)
            
            # compute the pad start in input_ids
            # and also truncate the predict
            input_ids = input_ids.tolist()
            try:
                input_id_pad_start = input_ids.index(tokenizer.pad_token_id)
            except:
                input_id_pad_start = len(input_ids)
            input_ids = input_ids[:input_id_pad_start]
            tokens = tokens[:input_id_pad_start]
    
            # predicted_token_class_ids
            predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids]
            predicted_tokens_classes = predicted_tokens_classes[:input_id_pad_start]

            for token,ner in zip(tokens,predicted_tokens_classes):
                out.append((token,ner))
        return out

if __name__ == "__main__":
    window_size = 100
    step = 75
    text = "維基百科是維基媒體基金會運營的一個多語言的線上百科全書並以建立和維護作為開放式協同合作專案特點是自由內容自由編輯自由著作權目前是全球網路上最大且最受大眾歡迎的參考工具書名列全球二十大最受歡迎的網站其在搜尋引擎中排名亦較為靠前維基百科目前由非營利組織維基媒體基金會負責營運"
    dataset = DocumentDataset(text,window_size=window_size,step=step)
    dataloader = DataLoader(dataset=dataset,shuffle=False,batch_size=3)

    model_name = 'p208p2002/zh-wiki-punctuation-restore'
    model = AutoModelForTokenClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    model_pred_out = []
    for batch in dataloader:
        model_pred_out.append(predict_step(batch,model,tokenizer))
        
    merge_pred_result = merge_stride(model_pred_out,step)
    merge_pred_result_deocde = decode_pred(merge_pred_result)
    merge_pred_result_deocde = ''.join(merge_pred_result_deocde)
    print(merge_pred_result_deocde)
維基百科是維基媒體基金會運營的一個多語言的線上百科全書,並以建立和維護作為開放式協同合作。專案特點是自由內容、自由編輯、自由著作權。目前是全球網路上最大且最受大眾歡迎的參考工具書,名列全球二十大最受歡迎的網站。其在眾歡迎的參考工具書,名列全球二十大最受歡迎的網站,其在搜尋引擎中排名亦較為靠前。維基百科目前由非營利組織維基媒體基金會負責營運。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhpr-0.1.0.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zhpr-0.1.0-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file zhpr-0.1.0.tar.gz.

File metadata

  • Download URL: zhpr-0.1.0.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.9.13 Linux/5.15.0-43-generic

File hashes

Hashes for zhpr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fb004c454605552034da732c7d86c61543d6d8349cab4bd6c59a53e8bd67c782
MD5 efdb63de9e3b57c709dff21908127192
BLAKE2b-256 5046ce91b674ab54f3ee72cb264f835eca270ab4a501acd347df1043e6bd589d

See more details on using hashes here.

File details

Details for the file zhpr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: zhpr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.9.13 Linux/5.15.0-43-generic

File hashes

Hashes for zhpr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8cd2c81ea3c6fa6b556f97912e9222b6d97c599f49cc1165f52cd282bd042940
MD5 facafcf20311f1647b5e166064c1b3e1
BLAKE2b-256 e0f5db25bfe119e58502f53e60a7618e9a9c7cebafaf4c6bf011bea429452a9a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page