Skip to main content

A package for extracting word representations from BERT/XLNet

Project description

Embedding4BERT

Stable version Python3wheel:embedding4bert Download MIT License

Table of Contents

This is a Python library of extracting word embeddings from pre-trained language models.

User Guide

Installation

pip install --upgrade embedding4bert

Usage

Extract word embeddings of pretrained BERT models.

  • Sum the representations of the last four layers.
  • Take the mean of the representation of subword pieces as the word representations.
  1. Extract BERT word embeddings.
from embedding4bert import Embedding4BERT
emb4bert = Embedding4BERT("bert-base-cased") # bert-base-uncased
tokens, embeddings = emb4bert.extract_word_embeddings('This is a python library for extracting word representations from BERT.')
print(tokens)
print(embeddings.shape)

Expected output:

14 tokens: [CLS] This is a python library for extracting word representations from BERT. [SEP], 19 word-tokens: ['[CLS]', 'This', 'is', 'a', 'p', '##yt', '##hon', 'library', 'for', 'extract', '##ing', 'word', 'representations', 'from', 'B', '##ER', '##T', '.', '[SEP]']
['[CLS]', 'This', 'is', 'a', 'python', 'library', 'for', 'extracting', 'word', 'representations', 'from', 'BERT', '.', '[SEP]']
(14, 768)
  1. Extract XLNet word embeddings.
from embedding4bert import Embedding4BERT
emb4bert = Embedding4BERT("xlnet-base-cased")
tokens, embeddings = emb4bert.extract_word_embeddings('This is a python library for extracting word representations from BERT.')
print(tokens)
print(embeddings.shape)

Expected output:

11 tokens: This is a python library for extracting word representations from BERT., 16 word-tokens: ['▁This', '▁is', '▁a', '▁', 'py', 'thon', '▁library', '▁for', '▁extract', 'ing', '▁word', '▁representations', '▁from', '▁B', 'ERT', '.']
['▁This', '▁is', '▁a', '▁python', '▁library', '▁for', '▁extracting', '▁word', '▁representations', '▁from', '▁BERT.']
(11, 768)

Citation

For attribution in academic contexts, please cite this work as:

@misc{chai2020-embedding4bert,
  author = {Chai, Yekun},
  title = {embedding4bert: A python library for extracting word embeddings from pre-trained language models},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/cyk1337/embedding4bert}}
}

References

  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  2. XLNet: Generalized Autoregressive Pretraining for Language Understanding

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedding4bert-0.0.4.tar.gz (4.6 kB view hashes)

Uploaded Source

Built Distribution

embedding4bert-0.0.4-py3-none-any.whl (8.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page