Pretrained BERT models for encoding clinical trial documents to compact embeddings.
Project description
Trial2Vec
Wang, Zifeng and Sun, Jimeng. (2022). Trial2Vec: Zero-Shot Clinical Trial Document Similarity Search using Self-Supervision. Findings of EMNLP'22.
News
- 12/8/2022: Support
download_embeddingthat obtains the pretrained embedding only. It saves a lot of GPU/CPU memory! Please refer this example for detailed use cases.
from trial2vec import download_embedding
t2v_emb = download_embedding()
- 10/27/2022: Support
word_vectorandsentence_vector!
# sentence vectors
inputs = ['I am a sentence', 'I am another sentence']
outputs = model.sentence_vector(inputs)
# torch.tensor w/ shape [2, 128]
# word vectors
inputs = ['I am a sentence', 'I am another sentence abcdefg xyz']
outputs = model.word_vector(inputs)
# {'word_embs': torch.tensor w/ shape [2, max_token, 128], 'mask': torch.tensor w/ shape [2, max_token]}
Usage
Get pretrained Trial2Vec model in three lines:
from trial2vec import Trial2Vec
model = Trial2Vec()
model.from_pretrained()
A jupyter example is shown at https://github.com/RyanWangZf/Trial2Vec/blob/main/example/demo_trial2vec.ipynb.
How to install
Install the correct PyTorch version by referring to https://pytorch.org/get-started/locally/.
Then install Trial2Vec by
# Recommended because it is update to date, small bugs will be kept fixed
pip install git+https://github.com/RyanWangZf/Trial2Vec.git
or
pip install trial2vec
Search similar trials
Use Trial2Vec to search similar clinical trials:
# load demo data
from trial2vec import load_demo_data
data = load_demo_data()
# contains trial documents
test_data = {'x': data['x']}
# make prediction
pred = model.predict(test_data)
Encode trials
Use Trial2Vec to encode clinical trial documents:
test_data = {'x': df} # contains trial documents
emb = model.encode(test_data) # make inference
# or just find the pre-encoded trial documents
emb = [model[nct_id] for test_data['x']['nct_id']]
Continue training
One can continue to train the pretrained models on new trials as
# just formulate trial documents as the format of `data`
data = load_demo_data()
model.fit(
{
'x':data['x'], # document dataframe
'fields':data['fields'], # attribute field columns
'ctx_fields':data['ctx_fields'], # context field columns
'tag': data['tag'], # nct_id is the unique tag for each trial
},
valid_data={
'x':data['x_val'],
'y':data['y_val']
},
)
# save
model.save_model('./finetuned-trial2vec')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file Trial2Vec-0.1.0.tar.gz.
File metadata
- Download URL: Trial2Vec-0.1.0.tar.gz
- Upload date:
- Size: 24.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1341bd852c598a6a02aefa43e38f3e3f80a17299469da7c9c57174979b927e3
|
|
| MD5 |
0ea10528fa8ca855c326ee5664bbcd72
|
|
| BLAKE2b-256 |
d8773917d553ab6d02b09d8f415977e8f124e8e89b4c6062184fb8bbbead56e6
|
File details
Details for the file Trial2Vec-0.1.0-py3-none-any.whl.
File metadata
- Download URL: Trial2Vec-0.1.0-py3-none-any.whl
- Upload date:
- Size: 28.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
361c88f5a2fc7e74a5bcafceb409b109ce0ddce615721d412d960e684009a815
|
|
| MD5 |
66e74cc9a6ae84878cc68e7d9ac3f733
|
|
| BLAKE2b-256 |
0a0d15db1ee739865367d5bf4ea02426becaa489cb0188ffb33606db5fcdb9de
|