Module for generating aligned contextualized bert embeddings using different strategies
Project description
get_aligned_BERT_emb
Get the aligned BERT embedding for sequence labeling tasks
Installing as a dependency
pip install aligned-bert-embedder
Installing dependencies
conda env create -f environment.yml
Example of usage from cmd (not recommended):
python -m aligned_bert_embedder embed aligned_bert_embedder/configs/snip.yml aligned_bert_embedder/texts/triple.txt
Example of usage from code (preferable)
from aligned_bert_embedder import AlignedBertEmbedder
embeddings = AlignedBertEmbedder(config).embed(
(
(
'First', `sentence`, `or`, `other`, `context`, `chunk`
),
(
`Second`, `sentence`
)
)
)
The following is the content of the original README.md
file from the developer repo.
Why this repo?
In the origin script extract_features.py in BERT, tokens may be splited into pieces as follows:
orig_tokens = ["John", "Johanson", "'s", "house"]
bert_tokens = ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]
orig_to_tok_map = [1, 2, 4, 6]
We investigate 3 align strategies (first
, mean
and max
) to maintain an original-to-tokenized alignment. Take the "Johanson
-> johan
, ##son
" as example:
first
: take the representation ofjohan
as the whole wordJohanson
mean
: take the reduce_mean value of representations ofjohan
and##son
as the whole wordJohanson
max
: take the reduce_max value of representations ofjohan
and##son
as the whole wordJohanson
How to use this repo?
sh run.sh input_file outout_file BERT_BASE_DIR
# For example:
sh run.sh you_data you_data.bert path/to/bert/uncased_L-12_H-768_A-12
You can modify layers
and align_strategies
in the run.sh
.
How to load the output embeddings?
After the above procedure, you are expected to get a output file of contextual embeddings (e.g., your_data_6_mean). Then you can load this file like conventional word embeddings. For example in a python script:
with open("your_data_6_mean", "r", encoding="utf-8") as bert_f"
for line in bert_f:
bert_vec = [[float(value) for value in token.split()] for token in line.strip().split("|||")]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file aligned_bert_embedder-0.81-py3-none-any.whl
.
File metadata
- Download URL: aligned_bert_embedder-0.81-py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 756edce37afd31d3275c103fe639b4872a41ff115b66a61dc083a9ef43c420ef |
|
MD5 | 7da1d677937c47caed04f0c90d45a096 |
|
BLAKE2b-256 | eb736044b174146aa8e522cd6729604cd0e7191d8cbbafadbdf4e6bf2e2af629 |