Privacy-friendly data collection of the libraries your users are using
Project description
python-popularity-contest
In interactive computing installations, figuring out which python libraries are in use is extremely helpful in managing environments for users.
python-popularity-contest collects pre-aggregated, anonymized data on which installed libraries are being actively used by your users.
Named after the debian popularity contest
What data is collected?
We want to collect just enough data to help with the following tasks:
-
Remove unused library that have never been imported. These can probably be removed without a lot of breakage for individual users
-
Provide aggregate statistics about the 'popularity' of a library to add a data point for understanding how important a particular library is to a group of users. This can help with funding requests, better training recommendations, etc.
To collect the smallest amount of data possible, we aggregate this at source. Only overall global counts are stored, without any individual record of each source. This is much better than storing per-user or per-process records.
The data we have will be a time series for each library, representing the cumulative count of processes where any module from this library was imported. This is designed as a prometheus counter, which is how eventually queries are written.
Collection infrastructure
popularity_contest
emits metrics over the statsd
protocol, so you need a statsd server running to collect and aggregate
this information. Since statsd only stores global aggregate counts, we
never collect data beyond what we need.
The recommended collection pipeline is:
-
prometheus_statsd as the statsd server metrics are sent to.
A mapping rule to convert the statsd metrics into usable prometheus metrics, with helpful labels for library names. Instaed of many metrics named like
python_popcon_library_used_<library-name>
, we can get a betterpython_popcon_library_used{library="<library-name>"}
. A mapping rule that works with the default statsd metric name structure would look like:mappings: - match: "python_popcon.library_used.*" name: "python_popcon_library_used" labels: library: "$1"
You can add additional labels here if you would like.
-
A prometheus server that scrapes the metrics from prometheus_statsd and stores it in a queryable form. A tool like grafana is used to visualize the results.
Kubernetes setup
If you are running a kubernetes cluster of some sort, you probably already have prometheus running for metrics collection. prometheus_statsd has a helm chart that can be deployed easily on cluster. Here is a sample helm config:
service:
# Tell prometheus server we want metrics scraped from port 9102
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9102"
statsd:
mappingConfig: |-
mappings:
- match: "python_popcon.library_used.*"
name: "python_popcon_library_used
labels:
library: "$1"
The prometheus-statsd chart has a bug
where mappingConfig
does not take effect until you restart the prometheus-statsd
pod.
Installing
popularity-contest
is available from PyPI, and can be installed
with pip
.
python3 -m pip install popularity-contest
It must be installed in the environment we want instrumented.
Usage
Activation
After installation, the popularity_contest reporter must be explicitly set up. You can enable reporting for all IPython sessions (and hence Jupyter Notebook sessions) with an IPython startup script.
The startup script just needs one line:
import popularity_contest.reporter
popularity_contest.reporter.setup_reporter()
Since the instrumentation is usually set up by an admin and not
the user, the preferred path for the script is inside sys.prefix
- the
location of your virtual environment. For example, if you have a
conda environment installed in /opt/conda
, you can put the file in
/opt/conda/etc/ipython/startup/000-popularity-contest.py
. This
way, it also gets loaded before any user specific IPython startup
scripts.
Only modules imported after the reporter is set up with
popularity_contest.reporter.setup_reporter()
will be counted. This reduces
noise from baseline libraries (like IPython
or six
) that are used invisibly
by everyone.
Statsd server connection info
popularity_contest
expects the following environment variables
to be set.
-
PYTHON_POPCONTEST_STATSD_HOST
- the hostname or IP address of the server statsd packets will be sent to. -
PYTHON_POPCONTEST_STATSD_PORT
- the port to send statsd packets to. With the recommendedprometheus_statsd
setup, this will be9125
. -
PYTHON_POPCONTEST_STATSD_PREFIX
- the prefix each statsd metric will have, defaults topython_popcon.library_used
. So each metric in statsd will be of the formpython_popcon.library_used.<library-name>
.You can put additional information in this prefix, and use that to extract more labels in prometheus. For example, in a zero-to-jupyterhub on k8s setup, you can add information about the current hub namespace like this:
hub: extraConfig: 07-popularity-contest: | import os pod_namespace = os.environ['POD_NAMESPACE'] c.KubeSpawner.environment.update({ 'PYTHON_POPCONTEST_STATSD_PREFIX': f'python_popcon.namespace.{pod_namespace}.library_used' })
A mapping rule can be added to
prometheus_statsd
to extract the namespace.mappings: - match: "python_popcon.namespace.*.library_used.*" name: "python_popcon_library_used" labels: namespace: "$1" library: "$2"
The prometheus metrics produced out of this will be of the form
python_popcon_library_used{library="<library-name>", namespace="<namespace>}
Privacy
Collecting limited, pre-aggregated data helps preserve privacy as much as possible, and might be sufficient in cases where other data with more private information (like usernames tied to activity times, etc).
However, side channel attacks are still possible if the entire set of timeseries data is available. Individual users might have specific patterns of modules they use, and this might be discernable with enough analysis. If some libraries are uniquely used only by particular users, this analysis becomes easier. Further aggregation of the data, redaction of information about modules that don't have a lot of use, etc are methods that can be used to further anonymize this dataset, based on your needs.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for popularity_contest-0.4.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c8e6f1546620cdb7170761f3fc1a3f90f837fb74d267f1d623ae41956526f5c8 |
|
MD5 | 49f16b27d9052f61cb78c97dae7f6317 |
|
BLAKE2b-256 | 634af038c414a564fc431e9ea895c2abc0d5ad81f5180f83ba3b08da58585362 |