Skip to main content

A toolkit for survey evaluation

Project description

This repository contains a toolkit for AI-powered survey instrument evaluation. It’s still in early development, but is ready to support piloting and experimentation. To learn more about the overall project, see this blog post.

Installation

Installing the latest version with pip:

pip install surveyeval

Overview

Here are the basics:

  1. This toolkit includes code to read, parse, and evaluate survey instruments.

  2. The file-evaluation-example.ipynb Jupyter workbook provides a working example for evaluating a single survey instrument file. It includes details on how to install, configure, and run.

  3. The evaluation engine itself lives in the evaluation_engine module. It provides a pretty basic framework for applying different evaluation lenses to a survey instrument.

  4. The core_evaluation_lenses module contains an initial set of evaluation lenses that can be applied to survey instruments. These are the ones applied in the example workbook. They are:

    1. PhrasingEvaluationLens: Cases where phrasing might be adjusted to improve respondent understanding and reduce measurement error (i.e., the kinds of phrasing issues that would be identified through rigorous cognitive interviewing or other forms of validation)

    2. TranslationEvaluationLens: Cases where translations are inaccurate or phrased such that they might lead to differing response patterns

    3. BiasEvaluationLens: Cases where phrasing might be improved to remove implicit bias or stigmatizing language (inspired by this very helpful post on the subject of using ChatGPT to identify bias)

    4. ValidatedInstrumentEvaluationLens: Cases where a validated instrument might be adapted to better measure an inferred construct of interest

  5. The code for reading and parsing files is in the survey_parser module. There’s much there that can be improved about how different file formats are read into raw text, and then how they’re parsed into questions, modules, and so on. In particular, one might improve the range of examples provided to the LLM.

You can run the file-evaluation-example.ipynb workbook as-is, but you might also consider customizing the core evaluation lenses to better meet your needs and/or adding your own evaluation lenses to the workbook. When adding new lenses, you can just use any of the initial lenses as a template.

If you make use of this toolkit, we’d love to hear from you — and help to share your results with the community. Please email us at info@higherbar.ai.

Technical notes

Reading input files

The survey_parser module contains code for reading input files. It currently supports the following file formats:

  1. .docx: Word files are read in the read_docx() function, using LangChain’s UnstructuredFileLoader() function. Note that text within “content controls” is effectively invisible, so you might need to open your file, select all, and select “Remove content controls” to render the text visible (see here for more on content controls).

  2. .pdf: PDF files are read in the read_pdf_combined() function, which tries to read the text and tables in a PDF (separately), combine them together, and then fall back to using an OCR reader if that process didn’t find much text. There is a ton of room for improvement here.

  3. .xlsx: Excel files are read in the parse_xlsx() function, in two stages. First, it assumes that the file is in XLSForm format and uses the pyxform library to read the survey. If it encounters an error, it falls back to using LangChain’s UnstructuredExcelLoader() to load the workbook in HTML format, then uses that HTML as the raw text for parsing. There is much that can be improved, particularly in how XLSForms are handled (e.g., the current approach doesn’t handle translations well).

  4. .csv: CSV files are read in the parse_csv() function, also using two stages. First, it assumes that the file is a REDCap data dictionary and parses the columns accordingly. If it encounters an error, it falls back to just reading the file as raw text. There is much that can be improved here, particularly in how REDCap data dictionaries are handled (e.g., the current approach doesn’t handle modules or translations).

  5. .html: HTML files are read in the read_html() function, then converted into markdown for parsing.

All of the raw content is split into 6,000-character chunks with 500 characters of overlap, before being passed on for parsing. This is necessary to both (a) avoid overflowing LLM context windows, and (b) allow the LLM to focus on a tractable amount of text in any given request (with the latter becoming more important as the constraints on context windows are relaxed).

Overall, the code for reading files performs pretty poorly for all but the simplest formats. There’s much work to do here to improve quality.

Parsing input files

The actual parsing happens with LLM assistance, via the kor library. All of that code lives also in the survey_parser module, with the core parsing instructions and examples in create_schema().

Here too, performance can be quite poor, depending on the complexity of the source file’s organization and formatting. There’s much to improve here, and it’s worth trying the new LangChain approaches to extraction instead of Kor.

Tracking and reporting costs

The API usage — including cost estimates — is currently output to the console as INFO-level logs, but only for the parsing stage. The evaluation stage doesn’t currently track or report costs.

Roadmap

There’s much that can be improved here. For example:

  • We should track and report costs for the evaluation stage of the process.

  • We should generally overhaul the survey_parser module to better ingest different file formats into raw text that works consistently well for parsing. Better PDF, XLSForm, and REDCap support, in particular, would be nice.

  • We should try replacing Kor with the latest LangChain approaches to extraction.

  • We should add an LLM cache that avoids calling out to the LLM for responses that it already has from prior requests. After all, it’s common to evaluate the same instrument multiple times, and it’s incredibly wasteful to keep going back to the LLM for the same responses every time (for requests that haven’t changed in any way).

  • We should improve how findings are scored and filtered, to avoid giving overwhelming numbers of minor recommendations.

  • We should improve the output format to be more user-friendly. (For example, a direct Word output with comments and tracked changes would be very nice).

  • We should add more evaluation lenses. For example: * Double-barreled questions: Does any question ask about two things at once? * Leading questions: Are questions neutral and don’t lead the respondent towards a particular answer? * Response options: Are the response options exhaustive and mutually exclusive? * Question order effects: The order in which questions appear can influence how respondents interpret and answer subsequent items. It’s essential to evaluate if any questions might be leading or priming respondents in a way that could bias their subsequent answers. * Consistency: Are scales used consistently throughout the survey? * Reliability and validity: If established scales are used, have they been validated for the target population? * Length and respondent burden: Is the survey too long? Long surveys can lead to respondent fatigue, which in turn might lead to decreased accuracy or increased drop-out rates.

  • Ideally, we would parse modules into logical sub-modules that appear to measure a single construct, so that we can better evaluate whether to recommend adaptation of validated instruments. Right now, an entire module is evaluated at once, but modules often contain measurement of multiple constructs.

Credits

This toolkit was originally developed by Higher Bar AI, a public benefit corporation, with generous support from Dobility, the makers of SurveyCTO.

Full documentation

See the full reference documentation here:

https://surveyeval.readthedocs.io/

Local development

To develop locally:

  1. git clone https://github.com/higherbar-ai/survey-eval

  2. cd survey-eval

  3. python -m venv venv

  4. source venv/bin/activate

  5. pip install -r requirements.txt

For convenience, the repo includes .idea project files for PyCharm.

To rebuild the documentation:

  1. Update version number in /docs/source/conf.py

  2. Update layout or options as needed in /docs/source/index.rst

  3. In a terminal window, from the project directory:
    1. cd docs

    2. SPHINX_APIDOC_OPTIONS=members,show-inheritance sphinx-apidoc -o source ../src/surveyeval --separate --force

    3. make clean html

To rebuild the distribution packages:

  1. For the PyPI package:
    1. Update version number (and any build options) in /setup.py

    2. Confirm credentials and settings in ~/.pypirc

    3. Run /setup.py for the bdist_wheel and sdist build types (Tools… Run setup.py task… in PyCharm)

    4. Delete old builds from /dist

    5. In a terminal window:
      1. twine upload dist/* --verbose

  2. For GitHub:
    1. Commit everything to GitHub and merge to main branch

    2. Add new release, linking to new tag like v#.#.# in main branch

  3. For readthedocs.io:
    1. Go to https://readthedocs.org/projects/surveyeval/, log in, and click to rebuild from GitHub (only if it doesn’t automatically trigger)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

surveyeval-0.1.1.tar.gz (40.1 kB view details)

Uploaded Source

Built Distribution

surveyeval-0.1.1-py3-none-any.whl (37.4 kB view details)

Uploaded Python 3

File details

Details for the file surveyeval-0.1.1.tar.gz.

File metadata

  • Download URL: surveyeval-0.1.1.tar.gz
  • Upload date:
  • Size: 40.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for surveyeval-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a2b24ad43b71d5a1e5f05a1b908dde1231d867c2194dc4e7d28f5516167a59b3
MD5 7512664ea4de802e8dca48c8b2f321c1
BLAKE2b-256 a790616b7dc97ed0d197c17b96327d5d24acf17fe96ff75fd6846dcd49b1a849

See more details on using hashes here.

File details

Details for the file surveyeval-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: surveyeval-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 37.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for surveyeval-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 52d2e9a8cdb275234ea05d945e8863eddae4b17bcb63725aa5dfc6564b77edf5
MD5 b050ad7d723d6490598259d4a889956f
BLAKE2b-256 f6717b15f9c0af89c0df5e7e8549c68bfba4d5108d63562bcf4105e72800785d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page