Skip to main content

A package for loading and manipulating PubTator files as Python objects.

Project description

The PubTator Parsing Tool

A Python package for loading and manipulating PubTator files as Python objects.

Usage

For basic word tokenization and simple operations

from pubtatortool import PubTatorCorpus
train_corpus = PubTatorCorpus(['train_corpus_part_1.txt',
                               'train_corpus_part_2.txt'])
dev_corpus = PubTatorCorpus(['dev_corpus.txt'])
test_corpus = PubTatorCorpus(['test_corpus.txt'])

For wordpiece tokenization and full ability to encode and decode text for use with machine learning models

from pubtatortool import PubTatorCorpus
from pubtatortool.tokenization import get_tokenizer
tokenizer = get_tokenizer(tokenization='wordpiece', vocab='bert-base-cased')
train_corpus = PubTatorCorpus(['train_corpus_part_1.txt',
                               'train_corpus_part_2.txt'], tokenizer)
dev_corpus = PubTatorCorpus(['--dev_corpus.txt'], tokenizer)
test_corpus = PubTatorCorpus(['--test_corpus.txt'], tokenizer)

You can then serialize a corpus using Pickle, iterate over documents using corpus.document_list, and perform various operations on documents regardless of tokenization policy, even if it is lossy, without worrying about mention and text decoupling.

For example, you can create a TSV-formatted file from a PubTator file in 10 lines of code:

from pubtatortool import PubTatorCorpus
from pubtatortool.tokenization import get_tokenizer
tokenizer = get_tokenizer(tokenization='wordpiece', vocab='bert-base-cased')
corpus = PubTatorCorpus(['mycorpus.txt'], tokenizer)
with open('outfile.txt', 'w') as outfile:
    for doc in corpus.document_list:
        for sentence, targets in zip(doc.sentences, doc.sentence_targets()):
            for token, label in zip(sentence, targets):
                print("{tok}\t{lab}".format(tok=token, lab=label),
                      file=outfile)
            print('', file=outfile)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pubtatortool-0.2.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

pubtatortool-0.2-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file pubtatortool-0.2.tar.gz.

File metadata

  • Download URL: pubtatortool-0.2.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.7.9

File hashes

Hashes for pubtatortool-0.2.tar.gz
Algorithm Hash digest
SHA256 61937b4fc2fb7a54d4284ce7eaa84007e44a54f3d14e5e87f8e56f9b3727873f
MD5 f7612fe85e204dfa44ffaf469be9c7a3
BLAKE2b-256 b98c48cbe01f154c230b5b8136923fe2ca79e32e83971c1f58f59eabbe6440c1

See more details on using hashes here.

File details

Details for the file pubtatortool-0.2-py3-none-any.whl.

File metadata

  • Download URL: pubtatortool-0.2-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.7.9

File hashes

Hashes for pubtatortool-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 14ce31041a443fa8ff1bb7dc218f310b4f72346930f285be325ac6a5124349e4
MD5 ec07f61f662b9ec166b524bcbe1ff57c
BLAKE2b-256 73cd939f0bddb19191c1199eca66c713d77a3689456a17de22f22e2ec7df04f2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page