Skip to main content

Toolkit for ML-based survey quality control

Project description

The ml4qc Python package offers a toolkit for employing machine learning technologies in survey data quality control. Among other things, it helps to extend the surveydata package, advance SurveyCTO’s machine learning roadmap, and contribute to research like the following:

Can machine learning aid survey data quality-control efforts, even without access to actual survey data?

A robust quality-control process with some degree of human review is often crucial for survey data quality, but resources for human review are almost always limited and therefore rationed. While traditional statistical methods of directing quality-control efforts often rely on field-by-field analysis to check for outliers, enumerator effects, and unexpected patterns, newer machine-learning-based methods allow for a more holistic evaluation of interviews. ML methods also allow for human review to train models that can then predict the results of review, increasingly focusing review time on potential problems. In this paper, we present the results of a collaboration between research and practice that explored the potential of ML-based methods to direct and supplement QC efforts in several international settings. In particular, we look at the potential for privacy-protecting approaches that allow ML models to be trained and utilized without ever exposing personally-identifiable data — or, indeed, any survey data at all — to ML systems or analysts. Specifically, metadata and paradata, including rich but non-identifying data from mobile device sensors, is used in lieu of potentially-sensitive survey data.

Installation

Installing the latest version with pip:

pip install ml4qc

Overview

The ml4qc package builds on the scikit-learn toolset. It includes the following utility classes for working with survey data:

  • SurveyML provides core functionality, including preprocessing and outlier detection

  • SurveyMLClassifier builds on SurveyML, adding support for running classification models and reporting out results

While SurveyMLClassifier supports a variety of approaches, the currently-recommended approach to binary classification is as follows:

  1. Do not reweight for class imbalances; use SurveyMLClassifier.cv_for_best_hyperparameters() to find the optimal hyperparameters for a given dataset, with neg_log_loss, neg_brier_score, or roc_auc as the CV metric to optimize. This will optimize for an unbiased distribution of estimated probabilities.

  2. Use a calibration_method (isotonic or sigmoid) to calibrate the estimated probability distribution.

  3. Almost always (and especially when classes are imbalanced), specify a non-default option for the classification threshold (and possibly threshold_value), as the default threshold of 0.5 is unlikely to be optimal. When in doubt, use threshold='optimal_f' to choose the threshold that maximizes the F-1 score.

This is essentially the approach used in the examples linked below.

Examples

This package is best illustrated by way of example. The following example analyses are available:

Documentation

See the full reference documentation here:

https://ml4qc.readthedocs.io/

Project support

Dobility has generously provided financial and other support for v1 of the ml4qc package, including support for early testing and piloting.

Development

To develop locally:

  1. git clone https://github.com/orangechairlabs/ml4qc.git

  2. cd ml4qc

  3. python -m venv venv

  4. source venv/bin/activate

  5. pip install -r requirements.txt

For convenience, the repo includes .idea project files for PyCharm.

To rebuild the documentation:

  1. Update version number in /docs/source/conf.py

  2. Update layout or options as needed in /docs/source/index.rst

  3. In a terminal window, from the project directory:
    1. cd docs

    2. SPHINX_APIDOC_OPTIONS=members,show-inheritance sphinx-apidoc -o source ../src/ml4qc --separate --force

    3. make clean html

To rebuild the distribution packages:

  1. For the PyPI package:
    1. Update version number (and any build options) in /setup.py

    2. Confirm credentials and settings in ~/.pypirc

    3. Run /setup.py for bdist_wheel build type (Tools… Run setup.py task… in PyCharm)

    4. Delete old builds from /dist

    5. In a terminal window:
      1. twine upload dist/* --verbose

  2. For GitHub:
    1. Commit everything to GitHub and merge to main branch

    2. Add new release, linking to new tag like v#.#.# in main branch

  3. For readthedocs.io:
    1. Go to https://readthedocs.org/projects/ml4qc/, log in, and click to rebuild from GitHub (only if it doesn’t automatically trigger)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml4qc-0.1.6.tar.gz (19.2 kB view hashes)

Uploaded Source

Built Distribution

ml4qc-0.1.6-py3-none-any.whl (19.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page