A toolkit for survey evaluation
Project description
This repository contains a toolkit for AI-powered survey instrument evaluation. It’s still in early development, but is ready to support piloting and experimentation. To learn more about the overall project, see this blog post.
Installation
Installing the latest version with pip:
pip install surveyeval
Overview
Here are the basics:
This toolkit includes code to read, parse, and evaluate survey instruments.
The file-evaluation-example.ipynb Jupyter workbook provides a working example for evaluating a single survey instrument file. It includes details on how to install, configure, and run.
The evaluation engine itself lives in the evaluation_engine module. It provides a pretty basic framework for applying different evaluation lenses to a survey instrument.
The core_evaluation_lenses module contains an initial set of evaluation lenses that can be applied to survey instruments. These are the ones applied in the example workbook. They are:
PhrasingEvaluationLens: Cases where phrasing might be adjusted to improve respondent understanding and reduce measurement error (i.e., the kinds of phrasing issues that would be identified through rigorous cognitive interviewing or other forms of validation)
TranslationEvaluationLens: Cases where translations are inaccurate or phrased such that they might lead to differing response patterns
BiasEvaluationLens: Cases where phrasing might be improved to remove implicit bias or stigmatizing language (inspired by this very helpful post on the subject of using ChatGPT to identify bias)
ValidatedInstrumentEvaluationLens: Cases where a validated instrument might be adapted to better measure an inferred construct of interest
The code for reading and parsing files is in the survey_parser module. There’s much there that can be improved about how different file formats are read into raw text, and then how they’re parsed into questions, modules, and so on. In particular, one might improve the range of examples provided to the LLM.
You can run the file-evaluation-example.ipynb workbook as-is, but you might also consider customizing the core evaluation lenses to better meet your needs and/or adding your own evaluation lenses to the workbook. When adding new lenses, you can just use any of the initial lenses as a template.
If you make use of this toolkit, we’d love to hear from you — and help to share your results with the community. Please email us at info@higherbar.ai.
Technical notes
Reading input files
The survey_parser module contains code for reading input files. It currently supports the following file formats:
.docx: Word files are read in the read_docx() function, using LangChain’s UnstructuredFileLoader() function. Note that text within “content controls” is effectively invisible, so you might need to open your file, select all, and select “Remove content controls” to render the text visible (see here for more on content controls).
.pdf: PDF files are read in the read_pdf_combined() function, which tries to read the text and tables in a PDF (separately), combine them together, and then fall back to using an OCR reader if that process didn’t find much text. There is a ton of room for improvement here.
.xlsx: Excel files are read in the parse_xlsx() function, in two stages. First, it assumes that the file is in XLSForm format and uses the pyxform library to read the survey. If it encounters an error, it falls back to using LangChain’s UnstructuredExcelLoader() to load the workbook in HTML format, then uses that HTML as the raw text for parsing. There is much that can be improved, particularly in how XLSForms are handled (e.g., the current approach doesn’t handle translations well).
.csv: CSV files are read in the parse_csv() function, also using two stages. First, it assumes that the file is a REDCap data dictionary and parses the columns accordingly. If it encounters an error, it falls back to just reading the file as raw text. There is much that can be improved here, particularly in how REDCap data dictionaries are handled (e.g., the current approach doesn’t handle modules or translations).
.html: HTML files are read in the read_html() function, then converted into markdown for parsing.
All of the raw content is split into 6,000-character chunks with 500 characters of overlap, before being passed on for parsing. This is necessary to both (a) avoid overflowing LLM context windows, and (b) allow the LLM to focus on a tractable amount of text in any given request (with the latter becoming more important as the constraints on context windows are relaxed).
Overall, the code for reading files performs pretty poorly for all but the simplest formats. There’s much work to do here to improve quality.
Parsing input files
The actual parsing happens with LLM assistance, via the kor library. All of that code lives also in the survey_parser module, with the core parsing instructions and examples in create_schema().
Here too, performance can be quite poor, depending on the complexity of the source file’s organization and formatting. There’s much to improve here, and it’s worth trying the new LangChain approaches to extraction instead of Kor.
Tracking and reporting costs
The API usage — including cost estimates — is currently output to the console as INFO-level logs, but only for the parsing stage. The evaluation stage doesn’t currently track or report costs.
Roadmap
There’s much that can be improved here. For example:
We should track and report costs for the evaluation stage of the process.
We should generally overhaul the survey_parser module to better ingest different file formats into raw text that works consistently well for parsing. Better PDF, XLSForm, and REDCap support, in particular, would be nice.
We should try replacing Kor with the latest LangChain approaches to extraction.
We should add an LLM cache that avoids calling out to the LLM for responses that it already has from prior requests. After all, it’s common to evaluate the same instrument multiple times, and it’s incredibly wasteful to keep going back to the LLM for the same responses every time (for requests that haven’t changed in any way).
We should improve how findings are scored and filtered, to avoid giving overwhelming numbers of minor recommendations.
We should improve the output format to be more user-friendly. (For example, a direct Word output with comments and tracked changes would be very nice).
We should add more evaluation lenses. For example: * Double-barreled questions: Does any question ask about two things at once? * Leading questions: Are questions neutral and don’t lead the respondent towards a particular answer? * Response options: Are the response options exhaustive and mutually exclusive? * Question order effects: The order in which questions appear can influence how respondents interpret and answer subsequent items. It’s essential to evaluate if any questions might be leading or priming respondents in a way that could bias their subsequent answers. * Consistency: Are scales used consistently throughout the survey? * Reliability and validity: If established scales are used, have they been validated for the target population? * Length and respondent burden: Is the survey too long? Long surveys can lead to respondent fatigue, which in turn might lead to decreased accuracy or increased drop-out rates.
Ideally, we would parse modules into logical sub-modules that appear to measure a single construct, so that we can better evaluate whether to recommend adaptation of validated instruments. Right now, an entire module is evaluated at once, but modules often contain measurement of multiple constructs.
Credits
This toolkit was originally developed by Higher Bar AI, a public benefit corporation, with generous support from Dobility, the makers of SurveyCTO.
Full documentation
See the full reference documentation here:
Local development
To develop locally:
git clone https://github.com/higherbar-ai/survey-eval
cd survey-eval
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
For convenience, the repo includes .idea project files for PyCharm.
To rebuild the documentation:
Update version number in /docs/source/conf.py
Update layout or options as needed in /docs/source/index.rst
- In a terminal window, from the project directory:
cd docs
SPHINX_APIDOC_OPTIONS=members,show-inheritance sphinx-apidoc -o source ../src/surveyeval --separate --force
make clean html
To rebuild the distribution packages:
- For the PyPI package:
Update version number (and any build options) in /setup.py
Confirm credentials and settings in ~/.pypirc
Run /setup.py for the bdist_wheel and sdist build types (Tools… Run setup.py task… in PyCharm)
Delete old builds from /dist
- In a terminal window:
twine upload dist/* --verbose
- For GitHub:
Commit everything to GitHub and merge to main branch
Add new release, linking to new tag like v#.#.# in main branch
- For readthedocs.io:
Go to https://readthedocs.org/projects/surveyeval/, log in, and click to rebuild from GitHub (only if it doesn’t automatically trigger)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file surveyeval-0.1.3.tar.gz
.
File metadata
- Download URL: surveyeval-0.1.3.tar.gz
- Upload date:
- Size: 40.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f72b3e6a1ffc7e31f203ca8361954dd1290cbd80a4264c54bc4a9735362040ec |
|
MD5 | e93ae926905fbc5b2338c726d2aca3f8 |
|
BLAKE2b-256 | 914e3bdc20fae99c54e679be33275e7100b23b8ae8826b205ac61cc3b0045000 |
File details
Details for the file surveyeval-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: surveyeval-0.1.3-py3-none-any.whl
- Upload date:
- Size: 37.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ae008ddb7737e2d9edac2d3aa70bdd3e6932ced75e9c3a5d950b7019d9d362a3 |
|
MD5 | 57e80216f02bacad09b64b41d0bb64d0 |
|
BLAKE2b-256 | 6da6e1b3122a735409a022a1086726e7b7cb74760a66397c4f10f5e77f17eb1d |