Module for generating aligned contextualized bert embeddings using different strategies
Project description
get_aligned_BERT_emb
Get the aligned BERT embedding for sequence labeling tasks
Installing as a dependency
pip install aligned-bert-embedder
Installing dependencies
conda env create -f environment.yml
Example of usage from cmd (not recommended):
python -m aligned_bert_embedder embed aligned_bert_embedder/configs/snip.yml aligned_bert_embedder/texts/triple.txt
Example of usage from code (preferable)
from aligned_bert_embedder import AlignedBertEmbedder
embeddings = AlignedBertEmbedder(config).embed(
(
(
'First', `sentence`, `or`, `other`, `context`, `chunk`
),
(
`Second`, `sentence`
)
)
)
The following is the content of the original README.md file from the developer repo.
Why this repo?
In the origin script extract_features.py in BERT, tokens may be splited into pieces as follows:
orig_tokens = ["John", "Johanson", "'s", "house"]
bert_tokens = ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]
orig_to_tok_map = [1, 2, 4, 6]
We investigate 3 align strategies (first, mean and max) to maintain an original-to-tokenized alignment. Take the "Johanson -> johan, ##son" as example:
first: take the representation ofjohanas the whole wordJohansonmean: take the reduce_mean value of representations ofjohanand##sonas the whole wordJohansonmax: take the reduce_max value of representations ofjohanand##sonas the whole wordJohanson
How to use this repo?
sh run.sh input_file outout_file BERT_BASE_DIR
# For example:
sh run.sh you_data you_data.bert path/to/bert/uncased_L-12_H-768_A-12
You can modify layers and align_strategies in the run.sh.
How to load the output embeddings?
After the above procedure, you are expected to get a output file of contextual embeddings (e.g., your_data_6_mean). Then you can load this file like conventional word embeddings. For example in a python script:
with open("your_data_6_mean", "r", encoding="utf-8") as bert_f"
for line in bert_f:
bert_vec = [[float(value) for value in token.split()] for token in line.strip().split("|||")]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aligned_bert_embedder-0.81-py3-none-any.whl.
File metadata
- Download URL: aligned_bert_embedder-0.81-py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
756edce37afd31d3275c103fe639b4872a41ff115b66a61dc083a9ef43c420ef
|
|
| MD5 |
7da1d677937c47caed04f0c90d45a096
|
|
| BLAKE2b-256 |
eb736044b174146aa8e522cd6729604cd0e7191d8cbbafadbdf4e6bf2e2af629
|