Skip to main content

A toolkit for survey evaluation

Project description

This repository contains a toolkit for AI-powered survey instrument evaluation. It’s still in early development, but is ready to support piloting and experimentation. To learn more about the overall project, see this blog post.

Installation

Installing the latest version with pip:

pip install surveyeval

Note that you might need to install additional requirements to use the survey_parser module. Only requirements for the core evaluation engine are automatically installed by pip. To install all requirements, use the requirements list from the full repo:

pip install -r requirements.txt

Overview

Here are the basics:

  1. This toolkit includes code to read, parse, and evaluate survey instruments.

  2. The file-evaluation-example.ipynb Jupyter workbook provides a working example for evaluating a single survey instrument file. It includes details on how to install, configure, and run.

  3. The evaluation engine itself lives in the evaluation_engine module. It provides a pretty basic framework for applying different evaluation lenses to a survey instrument.

  4. The core_evaluation_lenses module contains an initial set of evaluation lenses that can be applied to survey instruments. These are the ones applied in the example workbook. They are:

    1. PhrasingEvaluationLens: Cases where phrasing might be adjusted to improve respondent understanding and reduce measurement error (i.e., the kinds of phrasing issues that would be identified through rigorous cognitive interviewing or other forms of validation)

    2. TranslationEvaluationLens: Cases where translations are inaccurate or phrased such that they might lead to differing response patterns

    3. BiasEvaluationLens: Cases where phrasing might be improved to remove implicit bias or stigmatizing language (inspired by this very helpful post on the subject of using ChatGPT to identify bias)

    4. ValidatedInstrumentEvaluationLens: Cases where a validated instrument might be adapted to better measure an inferred construct of interest

  5. The code for reading and parsing files is in the survey_parser module. There’s much there that can be improved about how different file formats are read into raw text, and then how they’re parsed into questions, modules, and so on. In particular, one might improve the range of examples provided to the LLM.

You can run the file-evaluation-example.ipynb workbook as-is, but you might also consider customizing the core evaluation lenses to better meet your needs and/or adding your own evaluation lenses to the workbook. When adding new lenses, you can just use any of the initial lenses as a template.

If you make use of this toolkit, we’d love to hear from you — and help to share your results with the community. Please email us at info@higherbar.ai.

Technical notes

Reading input files

The survey_parser module contains code for reading input files. It currently supports the following file formats:

  1. .docx: Word files are read in the read_docx() function, using LangChain’s UnstructuredFileLoader() function. Note that text within “content controls” is effectively invisible, so you might need to open your file, select all, and select “Remove content controls” to render the text visible (see here for more on content controls).

  2. .pdf: PDF files are read in the read_pdf_combined() function, which tries to read the text and tables in a PDF (separately), combine them together, and then fall back to using an OCR reader if that process didn’t find much text. There is a ton of room for improvement here.

  3. .xlsx: Excel files are read in the parse_xlsx() function, in two ways. If the file looks like it’s in XLSForm format, it parses it accordingly; this parsing should be completely lossless and requires no additional parsing at later stages. If the file does not appear to be an XLSForm, the reader falls back to using LangChain’s UnstructuredExcelLoader() to load the workbook in HTML format, then uses that HTML as the raw text for parsing. XLSForm handling should be robust, but there is much that can be improved in how other formats are handled.

  4. .csv: CSV files are read in the parse_csv() function, also in two ways. If the file looks like a REDCap data dictionary, it will parse the columns accordingly (requiring little to no later processing). Otherwise, it falls back to just reading the file as raw text. There is much that can be improved here, particularly in how REDCap data dictionaries are handled (e.g., the current approach doesn’t handle modules or translations).

  5. .html: HTML files are read in the read_html() function, then converted into markdown for parsing.

All of the raw content is split into 3,000-character chunks with 500 characters of overlap, before being passed on for parsing. This is necessary to (a) avoid overflowing LLM context windows, (b) avoid overflowing output token limits, and (c) allow the LLM to focus on a tractable amount of text in any given request (with the latter becoming more important as the constraints on context windows are relaxed).

Overall, the code for reading files performs pretty poorly for all but the simplest formats. There’s much work to do here to improve quality.

Parsing input files

The actual parsing happens in the survey_parser module, with LLM assistance via the LangChain approach to extraction.

If performance is poor for your file, you can try giving the parser some examples from the raw data read from your file. Search for examples in the survey_parser module to see the baseline examples. Then create your own examples and pass them in as the replacement_examples or additional_examples parameters to the extract_data() function. This will help the LLM to better understand your file format.

Tracking and reporting costs

The API usage — including cost estimates — is currently output to the console as INFO-level logs, but only for the parsing stage. The evaluation stage doesn’t currently track or report costs.

Roadmap

There’s much that can be improved here. For example:

  • We should track and report costs for the evaluation stage of the process.

  • We should generally overhaul the survey_parser module to better ingest different file formats into raw text that works consistently well for parsing. Better PDF and REDCap support, in particular, would be nice.

  • We should add an LLM cache that avoids calling out to the LLM for responses that it already has from prior requests. After all, it’s common to evaluate the same instrument multiple times, and it’s incredibly wasteful to keep going back to the LLM for the same responses every time (for requests that haven’t changed in any way).

  • We should improve how findings are scored and filtered, to avoid giving overwhelming numbers of minor recommendations.

  • We should improve the output format to be more user-friendly. (For example, a direct Word output with comments and tracked changes would be very nice).

  • We should add more evaluation lenses. For example: * Double-barreled questions: Does any question ask about two things at once? * Leading questions: Are questions neutral and don’t lead the respondent towards a particular answer? * Response options: Are the response options exhaustive and mutually exclusive? * Question order effects: The order in which questions appear can influence how respondents interpret and answer subsequent items. It’s essential to evaluate if any questions might be leading or priming respondents in a way that could bias their subsequent answers. * Consistency: Are scales used consistently throughout the survey? * Reliability and validity: If established scales are used, have they been validated for the target population? * Length and respondent burden: Is the survey too long? Long surveys can lead to respondent fatigue, which in turn might lead to decreased accuracy or increased drop-out rates.

  • Ideally, we would parse modules into logical sub-modules that appear to measure a single construct, so that we can better evaluate whether to recommend adaptation of validated instruments. Right now, an entire module is evaluated at once, but modules often contain measurement of multiple constructs.

Credits

This toolkit was originally developed by Higher Bar AI, a public benefit corporation, with generous support from Dobility, the makers of SurveyCTO.

Full documentation

See the full reference documentation here:

https://surveyeval.readthedocs.io/

Local development

To develop locally:

  1. git clone https://github.com/higherbar-ai/survey-eval

  2. cd survey-eval

  3. python -m venv venv

  4. source venv/bin/activate

  5. pip install -r requirements.txt

For convenience, the repo includes .idea project files for PyCharm.

To rebuild the documentation:

  1. Update version number in /docs/source/conf.py

  2. Update layout or options as needed in /docs/source/index.rst

  3. In a terminal window, from the project directory:
    1. cd docs

    2. SPHINX_APIDOC_OPTIONS=members,show-inheritance sphinx-apidoc -o source ../src/surveyeval --separate --force

    3. make clean html

To rebuild the distribution packages:

  1. For the PyPI package:
    1. Update version number (and any build options) in /setup.py

    2. Confirm credentials and settings in ~/.pypirc

    3. Run /setup.py for the bdist_wheel and sdist build types (Tools… Run setup.py task… in PyCharm)

    4. Delete old builds from /dist

    5. In a terminal window:
      1. twine upload dist/* --verbose

  2. For GitHub:
    1. Commit everything to GitHub and merge to main branch

    2. Add new release, linking to new tag like v#.#.# in main branch

  3. For readthedocs.io:
    1. Go to https://readthedocs.org/projects/surveyeval/, log in, and click to rebuild from GitHub (only if it doesn’t automatically trigger)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

surveyeval-0.1.19.tar.gz (87.1 kB view details)

Uploaded Source

Built Distribution

surveyeval-0.1.19-py3-none-any.whl (83.7 kB view details)

Uploaded Python 3

File details

Details for the file surveyeval-0.1.19.tar.gz.

File metadata

  • Download URL: surveyeval-0.1.19.tar.gz
  • Upload date:
  • Size: 87.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for surveyeval-0.1.19.tar.gz
Algorithm Hash digest
SHA256 ea730337243214a10764c666e0ee9c9643c83278eda8d86371eac23cfce6c0f0
MD5 67213487190400c7d77ed544f9d6847a
BLAKE2b-256 8c60d39113108affae4f560474f4cee0f209d7bc8b93ba1e0da113e94a1d0dfe

See more details on using hashes here.

File details

Details for the file surveyeval-0.1.19-py3-none-any.whl.

File metadata

  • Download URL: surveyeval-0.1.19-py3-none-any.whl
  • Upload date:
  • Size: 83.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for surveyeval-0.1.19-py3-none-any.whl
Algorithm Hash digest
SHA256 c5a4c28f676bf8dfa28443de76c4a760191f8b5f0a12345c6679748e991444c7
MD5 eaa40ba0fb25d9267fe674b8989d6b1f
BLAKE2b-256 bf5bc8ea70fa30399a7f37fa19d27a1a892dadcdfd4f289adfe46f195d6075a7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page