Skip to main content

A package for loading and manipulating PubTator files as Python objects.

Project description

The PubTator Parsing Tool

A Python package for loading and manipulating PubTator files as Python objects.

Installation

This package is on the Python Package Index. You can install it using pip install pubtatortool.

Usage

For basic word tokenization and simple operations

from pubtatortool import PubTatorCorpus
train_corpus = PubTatorCorpus(['train_corpus_part_1.txt',
                               'train_corpus_part_2.txt'])
dev_corpus = PubTatorCorpus(['dev_corpus.txt'])
test_corpus = PubTatorCorpus(['test_corpus.txt'])

For wordpiece tokenization and full ability to encode and decode text for use with machine learning models

from pubtatortool import PubTatorCorpus
from pubtatortool.tokenization import get_tokenizer
tokenizer = get_tokenizer(tokenization='wordpiece', vocab='bert-base-cased')
train_corpus = PubTatorCorpus(['train_corpus_part_1.txt',
                               'train_corpus_part_2.txt'], tokenizer)
dev_corpus = PubTatorCorpus(['dev_corpus.txt'], tokenizer)
test_corpus = PubTatorCorpus(['test_corpus.txt'], tokenizer)

You can then serialize a corpus using Pickle, iterate over documents using corpus.document_list, and perform various operations on documents regardless of tokenization policy, even if it is lossy, without worrying about mention and text decoupling.

For example, you can create a TSV-formatted file from a PubTator file in 10 lines of code:

from pubtatortool import PubTatorCorpus
from pubtatortool.tokenization import get_tokenizer
tokenizer = get_tokenizer(tokenization='wordpiece', vocab='bert-base-cased')
corpus = PubTatorCorpus(['mycorpus.txt'], tokenizer)
with open('outfile.txt', 'w') as outfile:
    for doc in corpus.document_list:
        for sentence, targets in zip(doc.sentences, doc.sentence_targets()):
            for token, label in zip(sentence, targets):
                print("{tok}\t{lab}".format(tok=token, lab=label),
                      file=outfile)
            print('', file=outfile)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pubtatortool-0.6.2.tar.gz (9.0 kB view details)

Uploaded Source

File details

Details for the file pubtatortool-0.6.2.tar.gz.

File metadata

  • Download URL: pubtatortool-0.6.2.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for pubtatortool-0.6.2.tar.gz
Algorithm Hash digest
SHA256 75cf6d60d0f4c81f00e8ccbdbc4a05fba637e5eb97943b6cc7f0836f0e513a20
MD5 021f605766808437b9b26ea4a11baf81
BLAKE2b-256 0e67b90bc70893d7d99e88ee0db784eb42efdd59dba6e99698dd58d9127b40a0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page