Skip to main content

Use BERT to Fill in the Blanks

Project description

FitBERT

buff bert

FitBert ((F)ill (i)n (t)he blanks, (BERT)) is a library for using BERT to fill in the blank(s) in a section of text from a list of options. Here is the envisioned usecase for FitBert:

  1. A service (statistical model or something simpler) suggests replacements/corrections for a segment of text
  2. That service is specialized to a domain, and isn't good at the big picture, e.g. grammar
  3. That service passes the segment of text, with the words to be replaced identified, and the list of suggestions
  4. FitBert crushes all but the best suggestion :muscle:

Blog post walkthrough

Installation

License

This software is distributed under the Apache 2.0 license, except for the WordNet lemma data used for delemmatization, which is distributed with its original license, which is located in ./fitbert/data/LICENSE.

From PyPi

pip install fitbert

Usage

A Jupyter notebook with a short introduction is available here.

FitBert will automatically use GPU if torch.cuda.is_available(). Or when you instantiate it, you can pass FitBert(model_name="distilbert-base-uncased", disable_gpu=True). Fastest batches are using distilbert on CPU with batch size one, maximum throughput is with GPU and larger batches.

Usage as a library / in a server

from fitbert import FitBert


# currently supported models: bert-large-uncased and distilbert-base-uncased
# this takes a while and loads a whole big BERT into memory
fb = FitBert()

masked_string = "Why Bert, you're looking ***mask*** today!"
options = ['buff', 'handsome', 'strong']

ranked_options = fb.rank(masked_string, options=options)
# >>> ['handsome', 'strong', 'buff']
# or
filled_in = fb.fitb(masked_string, options=options)
# >>> "Why Bert, you're looking handsome today!"

We commonly find ourselves knowing what verb to suggest, but not what conjugation:

from fitbert import FitBert


fb = FitBert()

masked_string = "Why Bert, you're ***mask*** handsome today!"
options = ['looks']

filled_in = fb.fitb(masked_string, options=options)
# >>> "Why Bert, you're looking handsome today!"

# under the hood, we notice there is only one suggestion and act as if
# fitb was called with delemmatize=True:
filled_in = fb.fitb(masked_string, options=options, delemmatize=True)

If you are already using pytorch_pretrained_bert.BertForMaskedLM, or transformers.BertForMaskedLM and have an instance of BertForMaskedLM already instantiated, you can pass pass it in to reuse it:

BLM = pytorch_pretrained_bert.BertForMaskedLM.from_pretrained(model_name)
# or
BLM = transfomers.BertForMaskedLM.from_pretrained(model_name)
fb = FitBert(model=BLM)

You can also have FitBert mask the string for you

from fitbert import FitBert


fb = FitBert()

unmasked_string = "Why Bert, you're looks handsome today!"
span_to_mask = (17, 22)
masked_string, masked = fb.mask(unmasked_string, span_to_mask)
# >>> "Why Bert, you're ***mask*** handsome today!", 'looks'

# you can set options = [masked] or use any List[str]
options = [masked]

filled_in = fb.fitb(masked_string, options=options)
# >>> "Why Bert, you're looking handsome today!"

and there is a convenience method for doing this:

unmasked_string = "Why Bert, you're looks handsome today!"
span_to_mask = (17, 22)

filled_in = fb.mask_fitb(unmasked_string, span_to_mask)
# >>> "Why Bert, you're looking handsome today!"

Client

If you are sending strings to a FitBert server, you need to either mask the string yourself, or identify the span you want masked:

from fitbert import FitBert

s = "This might be justified as a means of signalling the connection between drunken driving and fatal accidents."

better_string, span_to_change = MyRuleBasedNLPModel.remove_overly_fancy_language(s)

assert better_string == "This might be justified to signalling the connection between drunken driving and fatal accidents.", "Notice 'as a means of' became 'to', but we didn't re-conjuagte signalling, or fix the spelling mistake"

assert span_to_change == (27, 37), "This span is the start and stop of the characters for the substring 'signalling'."

masked_string, replaced_substring = FitBert.mask(better_string, span_to_change)

assert masked_string == "This might be justified to ***mask*** the connection between drunken driving and fatal accidents."

assert replaced_substring == "signalling"

FitBertServer.fitb(masked_string, options=[replaced_substring])

The benefit to doing this over masking yourself is that if the internally used masking token changes, you don't have to know about that. Also, you don't need to make an instance of FitBert, so you don't have to incur the cost of downloading a pretrained Bert model.

However, you could also write your CallFitBertServer function to take an unmasked string and a span, something like:

FitBertServer.mask_fitb(better_string, span_to_change)

And then not need to install FitBert in your client at all.

Development

Run tests with python -m pytest or python -m pytest -m "not slow" to skip the 20 seconds of loading pretrained bert.

Acknowledgement

Thanks to NodoBird for letting us use the awesome portrait of Bert depicted above.

Citing

If you use FitBERT in your research, please cite with the following BibText

@misc{havens2019fitbert,
    title  = {Use BERT to Fill in the Blanks},
    author = {Sam Havens and Aneta Stal},
    url    = {https://github.com/Qordobacode/fitbert},
    year   = {2019}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fitbert-0.10.0.tar.gz (220.8 kB view details)

Uploaded Source

File details

Details for the file fitbert-0.10.0.tar.gz.

File metadata

  • Download URL: fitbert-0.10.0.tar.gz
  • Upload date:
  • Size: 220.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.8.2 requests/2.27.1 setuptools/60.3.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5

File hashes

Hashes for fitbert-0.10.0.tar.gz
Algorithm Hash digest
SHA256 1efcf8808d8adbbe33eaf442d21e89bb62df558c052d5fdfcba2d0b0a5f066d4
MD5 79b2e28275947450ccae920385b0165d
BLAKE2b-256 0cbaf53cd149204303094dc0f8a514d3c80d0e71713ed5e6eee110b6121eeb3f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page