Library to help analyze crowdsourcing results
Project description
crowdnalysis
Crowdsourcing Citizen Science projects usually require citizens to classify items (images, pdfs, songs, etc.) into one of a finite set of categories. Once an image is annotated by contributing citizens, we need to aggregate these annotations to obtain a consensus classification. Usually, the consensus for an item is achieved by selecting the most voted category for the item. crowdnalysis allows computing consensus using more advanced techniques beyond the standard majority voting. In particular, it provides consensus methods that model quality for each of the citizen scientists involved in the project. This more advanced consensus results in higher quality information for the Crowdsourcing Citizen Science project, an essential requirement as citizens are increasingly willing and able to contribute to science.
Implemented consensus algorithms
- Majority Voting
- Probabilistic
- Multinomial
- Dawid-Skene
In addition to the pure Python implementations above, the following models are implemented in the
probabilistic programming language Stan and used via the
CmdStanPy
interface:
- Multinomial
- Multinomial Eta
- Dawid-Skene
- Dawid-Skene Eta Hierarchical
~ Eta models impose that the probability of a reported label is higher for the real class in the error-rate (a.k.a. confusion) matrix.
Features
- Import annotation data from a
CSV
file with preprocessing option - Set inter-dependencies between questions to filter out irrelevant annotations
- Distinguish real classes for answers from reported labels (e.g., "Not answered")
- Calculate inter-rater reliability with different measures
- Fit selected model to annotation data and compute the consensus
- Compute the consensus with a fixed pre-determined set of parameters
- Fit the model parameters provided that the consensus is already known
- Generate the confusion matrix between a consensus and the ground truth
- Given the parameters of a generative model (Multinomial, Dawid-Skene); sample annotations, tasks, and workers (i.e., annotators)
- Conduct prospective analysis of the 'accuracy vs. number of annotations' for a given set of models
- Visualize the error-rate matrix for annotators
- Visualize the consensus on annotated images in
HTML
format
Quick start
crowdnalysis is distributed via PyPI: https://pypi.org/project/crowdnalysis/
You can easily install it just like any other PyPI package:
pip install crowdnalysis
CmdStanPy
will be installed automatically as a dependency.
However, this package requires the installation of the CmdStan
command-line interface too.
This can be done via executing the install_cmdstan
utility that comes with CmdStanPy
.
See related docs for more information.
install_cmdstan
Use the package in code:
>>> import crowdnalysis
Check available consensus models:
>>> crowdnalysis.factory.Factory.list_registered_algorithms()
See the TUTORIAL notebook for the usage of main features.
Unit tests
We use pytest as the testing framework. Tests can be run—at the cloned repo directory—by:
pytest
If you want to get the logs of the execution, run:
pytest --log-cli-level 0
Logging
We use the standard logging
library.
Deployment to PyPI
Note for contributors
Follow these simple steps to have a new release automatically deployed to PyPI
by the CD workflow.
The example is given for the version v1.0.2
:
- Update the version in
src/crowdnalysis/_version.py
:
__version__ = "1.0.2" # Note no "v" prefix here.
git push
the changes toorigin
and make sure the remotemaster
branch is up-to-date;- Create a new
tag
preferably with (multiline) annotation:
git tag -a v1.0.2 -m "
. Upgrade to CmdStanPy v1.0.1"
- Push the tag to
origin
:
git push origin v1.0.2
And shortly, the new version will be available on PyPI.
License
This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details.
Citation
If you find our software useful for your research, kindly consider citing it using the following biblatex
entry with
the DOI attached to all versions:
@software{crowdnalysis2022,
author = {Cerquides, Jesus and M{\"{u}}l{\^{a}}yim, Mehmet O{\u{g}}uz},
title = {crowdnalysis: A software library to help analyze crowdsourcing results},
month = jan,
year = 2022,
publisher = {Zenodo},
doi = {10.5281/zenodo.5898579},
url = {https://doi.org/10.5281/zenodo.5898579}
}
Acknowledgements
crowdnalysis is being developed within the Crowd4SDG and Humane-AI-net projects funded by the European Union’s Horizon 2020 research and innovation programme under grant agreements No. 872944 and No. 952026.
Reference
For the details of the conceptual and mathematical model of crowdnalysis, see:
[1] Cerquides, J.; Mülâyim, M.O.; Hernández-González, J.; Ravi Shankar, A.; Fernandez-Marquez, J.L. A Conceptual Probabilistic Framework for Annotation Aggregation of Citizen Science Data. Mathematics 2021, 9, 875. https://doi.org/10.3390/math9080875
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file crowdnalysis-1.1.2.tar.gz
.
File metadata
- Download URL: crowdnalysis-1.1.2.tar.gz
- Upload date:
- Size: 55.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06afafe168fb62c898e1591873ab56304cb2714735f0b4729254e4d3bbe41740 |
|
MD5 | f6422ddeb99a58768e7451c1b1b27d1b |
|
BLAKE2b-256 | 66db8d06e8e81d80846d7829da4a62d71300b4848edd4020f92344bb14d80686 |
File details
Details for the file crowdnalysis-1.1.2-py3-none-any.whl
.
File metadata
- Download URL: crowdnalysis-1.1.2-py3-none-any.whl
- Upload date:
- Size: 71.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7e382fd6f51f00f71380ae92b96651ef30c11827f767c0271a008028ed33f36 |
|
MD5 | 0c0eed39d7a6ffecaa675284378a38bd |
|
BLAKE2b-256 | db5e30d0c7d407ab3147244846ff89873593b296c1bb0dedfc140874c06a8287 |