Skip to main content

A package for loading and manipulating PubTator files as Python objects.

Project description

The PubTator Parsing Tool

A Python package for loading and manipulating PubTator files as Python objects.

##Usage For basic word tokenization and simple operations

from pubtatortool import PubTatorCorpus
train_corpus = PubTatorCorpus(['train_corpus_part_1.txt',
                               'train_corpus_part_2.txt'])
dev_corpus = PubTatorCorpus(['dev_corpus.txt'])
test_corpus = PubTatorCorpus(['test_corpus.txt'])

For wordpiece tokenization and full ability to encode and decode text for use with machine learning models

from pubtatortool import PubTatorCorpus
from pubtatortool.tokenization import get_tokenizer
tokenizer = get_tokenizer(tokenization='wordpiece', vocab='bert-base-cased')
train_corpus = PubTatorCorpus(['train_corpus_part_1.txt',
                               'train_corpus_part_2.txt'], tokenizer)
dev_corpus = PubTatorCorpus(['--dev_corpus.txt'], tokenizer)
test_corpus = PubTatorCorpus(['--test_corpus.txt'], tokenizer)

You can then serialize a corpus using Pickle, iterate over documents using corpus.document_list, and perform various operations on documents regardless of tokenization policy, even if it is lossy, without worrying about mention and text decoupling.

For example, you can create a TSV-formatted file from a PubTator file in 10 lines of code:

from pubtatortool import PubTatorCorpus
from pubtatortool.tokenization import get_tokenizer
tokenizer = get_tokenizer(tokenization='wordpiece', vocab='bert-base-cased')
corpus = PubTatorCorpus(['mycorpus.txt'], tokenizer)
with open('outfile.txt', 'w') as outfile:
    for doc in corpus.document_list:
        for sentence, targets in zip(doc.sentences, doc.sentence_targets()):
            for token, label in zip(sentence, targets):
                print("{tok}\t{lab}".format(tok=token, lab=label),
                      file='outfile')
            print('', file=outfile)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pubtatortool-0.1.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

pubtatortool-0.1-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file pubtatortool-0.1.tar.gz.

File metadata

  • Download URL: pubtatortool-0.1.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.7.9

File hashes

Hashes for pubtatortool-0.1.tar.gz
Algorithm Hash digest
SHA256 09ebd526eaa461b25a94c8f1366723cb38defbfec30c84686096a2c7abf29984
MD5 5a15025fa29acdaab443d00724978dd8
BLAKE2b-256 86513fb12ba0c6b6f856d575bcf5f6754b17da9186355a1937857c5132ac339d

See more details on using hashes here.

File details

Details for the file pubtatortool-0.1-py3-none-any.whl.

File metadata

  • Download URL: pubtatortool-0.1-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.7.9

File hashes

Hashes for pubtatortool-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dbe469e5d691c5b9594da5b08d1413fac30fb1d312fa6e32c80b8eeb11f42cb4
MD5 32a22b13ccb6266fc875b5f720b6cbf5
BLAKE2b-256 cb8e4e2d39116decff227fa65f0f77690a0054cb0f84a530cd10851941326a2f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page