Skip to main content

Module for generating aligned contextualized bert embeddings using different strategies

Project description

get_aligned_BERT_emb

Get the aligned BERT embedding for sequence labeling tasks

Installing as a dependency

pip install aligned-bert-embedder

Installing dependencies

conda env create -f environment.yml

Example of usage from cmd (not recommended):

python -m aligned_bert_embedder embed aligned_bert_embedder/configs/snip.yml aligned_bert_embedder/texts/triple.txt

Example of usage from code (preferable)

from aligned_bert_embedder import AlignedBertEmbedder

embeddings = AlignedBertEmbedder(config).embed(
  (
    (
      'First', `sentence`, `or`, `other`, `context`, `chunk`
    ),
    (
      `Second`, `sentence`
    )
  )
)

The following is the content of the original README.md file from the developer repo.

Why this repo?

In the origin script extract_features.py in BERT, tokens may be splited into pieces as follows:

orig_tokens = ["John", "Johanson", "'s",  "house"]
bert_tokens = ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]
orig_to_tok_map = [1, 2, 4, 6]

We investigate 3 align strategies (first, mean and max) to maintain an original-to-tokenized alignment. Take the "Johanson -> johan, ##son" as example:

  • first: take the representation of johan as the whole word Johanson
  • mean: take the reduce_mean value of representations of johan and ##son as the whole word Johanson
  • max: take the reduce_max value of representations of johan and ##son as the whole word Johanson

How to use this repo?

sh run.sh input_file outout_file BERT_BASE_DIR
# For example:
sh run.sh you_data you_data.bert path/to/bert/uncased_L-12_H-768_A-12 

You can modify layers and align_strategies in the run.sh.

How to load the output embeddings?

After the above procedure, you are expected to get a output file of contextual embeddings (e.g., your_data_6_mean). Then you can load this file like conventional word embeddings. For example in a python script:

with open("your_data_6_mean", "r", encoding="utf-8") as bert_f"
    for line in bert_f:
        bert_vec = [[float(value) for value in token.split()] for token in line.strip().split("|||")] 

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

aligned_bert_embedder-0.81-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file aligned_bert_embedder-0.81-py3-none-any.whl.

File metadata

  • Download URL: aligned_bert_embedder-0.81-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for aligned_bert_embedder-0.81-py3-none-any.whl
Algorithm Hash digest
SHA256 756edce37afd31d3275c103fe639b4872a41ff115b66a61dc083a9ef43c420ef
MD5 7da1d677937c47caed04f0c90d45a096
BLAKE2b-256 eb736044b174146aa8e522cd6729604cd0e7191d8cbbafadbdf4e6bf2e2af629

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page