Skip to main content

Utilities for managing nlp models and for processing text-related data at Wellcome Data Labs

Project description

Build Status codecov GitHub PyPI docs

WellcomeML utils

This package contains common utility functions for usual tasks at Wellcome Data Labs, in particular functionalities for processing, embedding and classifying text data. This includes

  • An intuitive sklearn-like API wrapping text vectorizers, such as Doc2vec, Bert, Scibert
  • Common API for off-the-shelf classifiers to allow quick iteration (e.g. Frequency Vectorizer, Bert, Scibert, basic CNN, BiLSTM, SemanticSimilarity)
  • Utils to download and convert academic text datasets for benchmark

For more information read the official docs.

1. Quickstart

Installing from PyPi

pip install wellcomeml

This will install the "vanilla" package. In order to install the deep-learning functionality (torch/transformers/spacy transformers):

pip install wellcomeml[spacy, tensorflow, torch]

For a list of functionalities/classes and the dependencies on "extras", see extras.

Installing from a release wheel

Download the wheel from aws and pip install it:

pip install wellcomeml-2020.1.0-py3-none-any.whl
pip install wellcomeml-2020.1.0-py3-none-any.whl[deep-learning]

1.1 Installing wellcomeml[deep-learning] on windows

Torch has a different installation for windows so it will not get automatically installed with wellcomeml[deeplearning]. It needs to be installed first (this is for machines with no CUDA parallel computing platform for those that do look here https://pytorch.org/ for correct installation):

pip install torch==1.5.1+cpu torchvision==0.6.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

Then install wellcomeml[deep-learning]:

pip install wellcomeml[deep-learning]

2. Development

2.1 Build local virtualenv

make

2.2 Contributing to the docs

Make changes to the .rst files in /docs (please do not change the ones starting by wellcomeml as those are generated automatically)

Navigate to the root repository and run

make update-docs

Verify that _build/html/index.html has generated correctly and submit a PR.

2.3 Release a new version (and upload to aws s3/pypi/github)

First create a github token, if you haven't one, with artifact write access and export it to the env variables:

export GITHUB_TOKEN=...

The checklist for a new release is:

  • Change wellcomeml/__version__.py
  • Add changelog
  • make dist
  • Verify new package was generated correctly on the pip registry and GitHub releases

2.4 (Optional) Installing from other locations

pip3 install <relative path to this folder>

2.5 Transformers

On OSX, if you get a message complaining about the rust compiler, install and initialise it with:

brew install rustup
rustup-init

3. Example usage of some modules

Examples can be found in the subfolder examples.

4. Troubleshooting

If you experience a problem with installing or using WellcomeML please open an issue. It might be worth setting the logging level to DEBUG export LOGGING_LEVEL=DEBUG which will often expose more information that might be informative to resolve the issue.

5. Extras

Module Description Extras needed
wellcomeml.ml.attention Classes that implement keras layers for attention/self-attention tensorflow
wellcomeml.ml.bert_classifier Classifier to facilitate fine-tuning bert/scibert tensorflow
wellcomeml.ml.bert_semantic_equivalence Classifier to learn semantic equivalence between pairs of documents tensorflow
wellcomeml.ml.bert_vectorizer Text vectorizer based on bert/scibert torch
wellcomeml.ml.bilstm BILSTM Text classifier tensorflow
wellcomeml.ml.clustering Text clustering pipeline NA
wellcomeml.ml.cnn CNN Text Classifier tensorflow
wellcomeml.ml.doc2vec_vectorizer Text vectorizer based on doc2vec NA
wellcomeml.ml.frequency_vectorizer Text vectorizer based on TF-IDF NA
wellcomeml.ml.keras_utils Utils for computing metrics during training tensorflow
wellcomeml.ml.keras_vectorizer Text vectorizer based on Keras tensorflow
wellcomeml.ml.sent2vec_vectorizer Text vectorizer based on Sent2Vec (Requires sent2vec, a non-pypi package)
wellcomeml.ml.similarity_entity_liking A class to find most similar documents to a sentence in a corpus tensorflow
wellcomeml.ml.spacy_classifier A text classifier based on spacy spacy, torch
wellcomeml.ml.spacy_entity_linking Similar to similarity_entity_linking, but uses spacy spacy
wellcomeml.ml.spacy_knowledge_base Creates a knowledge base of entities, based on spacy spacy
wellcomeml.ml.spacy_ner Named entity recognition classifier based on spacy spacy
wellcomeml.ml.transformers_tokenizer Bespoke tokenizer based on transformers Transformers
wellcomeml.ml.vectorizer Abstract class for vectorizers NA
wellcomeml.ml.voting_classifier Meta-classifier based on majority voting NA

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wellcomeml-1.2.0.tar.gz (55.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wellcomeml-1.2.0-py3-none-any.whl (67.8 kB view details)

Uploaded Python 3

File details

Details for the file wellcomeml-1.2.0.tar.gz.

File metadata

  • Download URL: wellcomeml-1.2.0.tar.gz
  • Upload date:
  • Size: 55.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.7

File hashes

Hashes for wellcomeml-1.2.0.tar.gz
Algorithm Hash digest
SHA256 4ec36af142e06521c43569113c3242027462102029bc161d291da9041aacb14f
MD5 f6b18c9c180705b233a53914914c210c
BLAKE2b-256 7597ae42a1553ff72bcf7170d11c226b1fc6823cd0f9b847c79c36966ef99d80

See more details on using hashes here.

File details

Details for the file wellcomeml-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: wellcomeml-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 67.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.7

File hashes

Hashes for wellcomeml-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 60ac50473b175682422822d4117dce905caaacbdb0aeff510a590bb817c383f2
MD5 29e69fe05a0ce9a9d7d4e5eece816e7b
BLAKE2b-256 9627c7ee4b6e66f7a1fce89d6f015594c6f554bba5c920bc2fbaae8009a3ee25

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page