A package for loading and manipulating PubTator files as Python objects.
Project description
The PubTator Parsing Tool
A Python package for loading and manipulating PubTator files as Python objects.
Installation
This package is on the Python Package Index. You can install it using pip install pubtatortool
.
Usage
For basic word tokenization and simple operations
from pubtatortool import PubTatorCorpus
train_corpus = PubTatorCorpus(['train_corpus_part_1.txt',
'train_corpus_part_2.txt'])
dev_corpus = PubTatorCorpus(['dev_corpus.txt'])
test_corpus = PubTatorCorpus(['test_corpus.txt'])
For wordpiece tokenization and full ability to encode and decode text for use with machine learning models
from pubtatortool import PubTatorCorpus
from pubtatortool.tokenization import get_tokenizer
tokenizer = get_tokenizer(tokenization='wordpiece', vocab='bert-base-cased')
train_corpus = PubTatorCorpus(['train_corpus_part_1.txt',
'train_corpus_part_2.txt'], tokenizer)
dev_corpus = PubTatorCorpus(['dev_corpus.txt'], tokenizer)
test_corpus = PubTatorCorpus(['test_corpus.txt'], tokenizer)
You can then serialize a corpus using Pickle, iterate over documents using corpus.document_list
, and perform various operations on documents regardless of tokenization policy, even if it is lossy, without worrying about mention and text decoupling.
For example, you can create a TSV-formatted file from a PubTator file in 10 lines of code:
from pubtatortool import PubTatorCorpus
from pubtatortool.tokenization import get_tokenizer
tokenizer = get_tokenizer(tokenization='wordpiece', vocab='bert-base-cased')
corpus = PubTatorCorpus(['mycorpus.txt'], tokenizer)
with open('outfile.txt', 'w') as outfile:
for doc in corpus.document_list:
for sentence, targets in zip(doc.sentences, doc.sentence_targets()):
for token, label in zip(sentence, targets):
print("{tok}\t{lab}".format(tok=token, lab=label),
file=outfile)
print('', file=outfile)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pubtatortool-0.6.4.tar.gz
.
File metadata
- Download URL: pubtatortool-0.6.4.tar.gz
- Upload date:
- Size: 9.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/0.0.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 638a36f6c6a027b2f399d79afe621b9571aa9762cf731f02312fb44237351018 |
|
MD5 | fb49ad2d76bc84c1fb45dc9f17235dc3 |
|
BLAKE2b-256 | f91141378b680ef8423a2da4c4e0eee126603660249d0e5224cd07162f478730 |