A package for extracting word representations from BERT/XLNet
Project description
Embedding4BERT
Table of Contents
This is a Python library of extracting word embeddings from pre-trained language models.
User Guide
Installation
pip install --upgrade embedding4bert
Usage
Extract word embeddings of pretrained BERT models.
- Sum the representations of the last four layers.
- Take the mean of the representation of subword pieces as the word representations.
- Extract BERT word embeddings.
from embedding4bert import Embedding4BERT
emb4bert = Embedding4BERT("bert-base-cased") # bert-base-uncased
tokens, embeddings = emb4bert.extract_word_embeddings('This is a python library for extracting word representations from BERT.')
print(tokens)
print(embeddings.shape)
Expected output:
14 tokens: [CLS] This is a python library for extracting word representations from BERT. [SEP], 19 word-tokens: ['[CLS]', 'This', 'is', 'a', 'p', '##yt', '##hon', 'library', 'for', 'extract', '##ing', 'word', 'representations', 'from', 'B', '##ER', '##T', '.', '[SEP]']
['[CLS]', 'This', 'is', 'a', 'python', 'library', 'for', 'extracting', 'word', 'representations', 'from', 'BERT', '.', '[SEP]']
(14, 768)
- Extract XLNet word embeddings.
from embedding4bert import Embedding4BERT
emb4bert = Embedding4BERT("xlnet-base-cased")
tokens, embeddings = emb4bert.extract_word_embeddings('This is a python library for extracting word representations from BERT.')
print(tokens)
print(embeddings.shape)
Expected output:
11 tokens: This is a python library for extracting word representations from BERT., 16 word-tokens: ['▁This', '▁is', '▁a', '▁', 'py', 'thon', '▁library', '▁for', '▁extract', 'ing', '▁word', '▁representations', '▁from', '▁B', 'ERT', '.']
['▁This', '▁is', '▁a', '▁python', '▁library', '▁for', '▁extracting', '▁word', '▁representations', '▁from', '▁BERT.']
(11, 768)
Citation
For attribution in academic contexts, please cite this work as:
@misc{chai2020-embedding4bert,
author = {Chai, Yekun},
title = {embedding4bert: A python library for extracting word embeddings from pre-trained language models},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/cyk1337/embedding4bert}}
}
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
embedding4bert-0.0.4.tar.gz
(4.6 kB
view hashes)
Built Distribution
Close
Hashes for embedding4bert-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc8296038e29a6899474314a6d786b04a6645abf7e08c6c2547de28052ffc752 |
|
MD5 | 2d55c48f9de32be50579c391a338f6b8 |
|
BLAKE2b-256 | 0ce1a1288b5a4c0445fbb389c26d56bfd8b5e86257cc3a32dec6330d65c6677f |