Skip to main content

Harmony Tool for Retrospective Data Harmonisation

Project description

The Harmony Project logo

🌐 harmonydata.ac.uk Harmony | LinkedIn Harmony | X Harmony | Instagram Harmony | Facebook Harmony | YouTube

Harmony on Twitter

Harmony Python library

PyPI package my badge License tests Current Release Version pypi Version version number PyPi downloads forks docker

You can also join our Discord server! If you found Harmony helpful, you can leave us a review!

What does Harmony do?

  • Psychologists and social scientists often have to match items in different questionnaires, such as "I often feel anxious" and "Feeling nervous, anxious or afraid".
  • This is called harmonisation.
  • Harmonisation is a time consuming and subjective process.
  • Going through long PDFs of questionnaires and putting the questions into Excel is no fun.
  • Enter Harmony, a tool that uses natural language processing and generative AI models to help researchers harmonise questionnaire items, even in different languages.

Quick start with the code

Read our guide to contributing to Harmony here or read CONTRIBUTING.md.

You can run the walkthrough Python notebook in Google Colab with a single click: Open In Colab

You can also download an R markdown notebook to run in R Studio: Open In R Studio

You can run the walkthrough R notebook in Google Colab with a single click: Open In Colab

The Harmony Project

Harmony is a tool using AI which allows you to compare items from questionnaires and identify similar content. You can try Harmony at https://harmonydata.ac.uk/app and you can read our blog at https://harmonydata.ac.uk/blog/.

Who to contact?

You can contact Harmony team at https://harmonydata.ac.uk/, or Thomas Wood at https://fastdatascience.com/.

🖥 Installation instructions (video)

Installing Harmony

🖱 Looking to try Harmony in the browser?

Visit: https://harmonydata.ac.uk/app/

You can also visit our blog at https://harmonydata.ac.uk/

✅ You need Tika if you want to extract instruments from PDFs

Download and install Java if you don't have it already. Download and install Apache Tika and run it on your computer https://tika.apache.org/download.html

java -jar tika-server-standard-2.3.0.jar

Requirements

You need a Windows, Linux or Mac system with

  • Python 3.8 or above
  • the requirements in requirements.txt
  • Java (if you want to extract items from PDFs)
  • Apache Tika (if you want to extract items from PDFs)

🖥 Installing Harmony Python package

You can install from PyPI.

pip install harmonydata

Loading all models

Harmony uses spaCy to help with text extraction from PDFs. spaCy models can be downloaded with the following command in Python:

import harmony
harmony.download_models()

Matching example instruments

instruments = harmony.example_instruments["CES_D English"], harmony.example_instruments["GAD-7 Portuguese"]
questions, similarity, query_similarity, new_vectors_dict = harmony.match_instruments(instruments)

How to load a PDF, Excel or Word into an instrument

harmony.load_instruments_from_local_file("gad-7.pdf")

Optional environment variables

As an alternative to downloading models, you can set environment variables so that Harmony calls spaCy on a remote server. This is only necessary if you are making a server deployment of Harmony.

  • HARMONY_SPACY_PATH - determines where model files are stored. Defaults to HOME DIRECTORY/harmony
  • HARMONY_DATA_PATH - determines where data files are stored. Defaults to HOME DIRECTORY/harmony
  • HARMONY_NO_PARSING - set to 1 to import a lightweight variant of Harmony which doesn't support PDF parsing.
  • HARMONY_NO_MATCHING - set to 1 to import a lightweight variant of Harmony which doesn't support matching.

Loading instruments from PDFs

If you have a local file, you can load it into a list of Instrument instances:

from harmony import load_instruments_from_local_file
instruments = load_instruments_from_local_file("gad-7.pdf")

Matching instruments

Once you have some instruments, you can match them with each other with a call to match_instruments.

from harmony import match_instruments
all_questions, similarity, query_similarity, new_vectors_dict = match_instruments(instruments)
  • all_questions is a list of the questions passed to Harmony, in order.
  • similarity is the similarity matrix returned by Harmony.
  • query_similarity is the degree of similarity of each item to an optional query passed as argument to match_instruments.

⇗⇗ Using a different vectorisation function

Harmony defaults to sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (HuggingFace link). However you can use other sentence transformers from HuggingFace by setting the environment HARMONY_SENTENCE_TRANSFORMER_PATH before importing Harmony:

export HARMONY_SENTENCE_TRANSFORMER_PATH=sentence-transformers/distiluse-base-multilingual-cased-v2

Using OpenAI or other LLMs for vectorisation

Any word vector representation can be used by Harmony. The below example works for OpenAI's text-embedding-ada-002 model as of July 2023, provided you have create a paid OpenAI account. However, since LLMs are progressing rapidly, we have chosen not to integrate Harmony directly into the OpenAI client libraries, but instead allow you to pass Harmony any vectorisation function of your choice.

import openai
import numpy as np
from harmony import match_instruments_with_function, example_instruments
model_name = "text-embedding-ada-002"
def convert_texts_to_vector(texts):
    vectors = openai.Embedding.create(input = texts, model=model_name)['data']
    return np.asarray([vectors[i]["embedding"] for i in range(len(vectors))])
instruments = example_instruments["CES_D English"], example_instruments["GAD-7 Portuguese"]
all_questions, similarity, query_similarity, new_vectors_dict = match_instruments_with_function(instruments, None, convert_texts_to_vector)

💻 Do you want to run Harmony in your browser locally?

Download and install Docker:

Open a Terminal and run

docker run -p 8000:8000 -p 3000:3000 harmonydata/harmonylocal

Then go to http://localhost:3000 in your browser.

Looking for the Harmony API?

Visit: https://github.com/harmonydata/harmonyapi

Docker images

If you are a Docker user, you can run Harmony from a pre-built Docker image.

Contributing to Harmony

If you'd like to contribute to this project, you can contact us at https://harmonydata.ac.uk/ or make a pull request on our Github repository. You can also raise an issue.

Developing Harmony

🧪 Automated tests

Test code is in tests/ folder using unittest.

The testing tool tox is used in the automation with GitHub Actions CI/CD. Since the PDF extraction also needs Java and Tika installed, you cannot run the unit tests without first installing Java and Tika. See above for instructions.

🧪 Use tox locally

Install tox and run it:

pip install tox
tox

In our configuration, tox runs a check of source distribution using check-manifest (which requires your repo to be git-initialized (git init) and added (git add .) at least), setuptools's check, and unit tests using pytest. You don't need to install check-manifest and pytest though, tox will install them in a separate environment.

The automated tests are run against several Python versions, but on your machine, you might be using only one version of Python, if that is Python 3.9, then run:

tox -e py39

Thanks to GitHub Actions' automated process, you don't need to generate distribution files locally.

⚙️Continuous integration/deployment to PyPI

This package is based on the template https://pypi.org/project/example-pypi-package/

This package

  • uses GitHub Actions for both testing and publishing
  • is tested when pushing master or main branch, and is published when create a release
  • includes test files in the source distribution
  • uses setup.cfg for version single-sourcing (setuptools 46.4.0+)

⚙️Re-releasing the package manually

The code to re-release Harmony on PyPI is as follows:

source activate py311
pip install twine
rm -rf dist
python setup.py sdist
twine upload dist/*

‎😃💁 Who worked on Harmony?

Harmony is a collaboration project between Ulster University, University College London, the Universidade Federal de Santa Maria, and Fast Data Science. Harmony is funded by Wellcome as part of the Wellcome Data Prize in Mental Health.

The core team at Harmony is made up of:

📜 License

MIT License. Copyright (c) 2023 Ulster University (https://www.ulster.ac.uk)

📜 How do I cite Harmony?

McElroy, E., Moltrecht, B., Ploubidis, G.B., Scopel Hoffman, M., Wood, T.A., Harmony [Computer software], Version 1.0, accessed at https://harmonydata.ac.uk/app. Ulster University (2023)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harmonydata-1.0.0.tar.gz (170.8 kB view details)

Uploaded Source

Built Distribution

harmonydata-1.0.0-py3-none-any.whl (147.9 kB view details)

Uploaded Python 3

File details

Details for the file harmonydata-1.0.0.tar.gz.

File metadata

  • Download URL: harmonydata-1.0.0.tar.gz
  • Upload date:
  • Size: 170.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for harmonydata-1.0.0.tar.gz
Algorithm Hash digest
SHA256 333233f0c1feeb53fd013c9b1455d3b5ce3ef3c16033cc65b7489fdff47aa3f2
MD5 5ec3189c171ca22e3ea79fe86a346a19
BLAKE2b-256 d8cfecc32e782c46f1b76079c0c866d5c8d03c2b2e509a379ca92deedcc0d01f

See more details on using hashes here.

File details

Details for the file harmonydata-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: harmonydata-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 147.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for harmonydata-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 49555d502d28da7b27a499808117ba5811e360d52eff184c0ed573357d3689f4
MD5 35f82603d2bd36e91696ff26c9d05ca9
BLAKE2b-256 51383d754a29ebd2953dc3d79abf1dac65df92d0001c086e2d31ca515c691cd5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page