A package for generating impact charts.
Project description
impactchart
impactchart
is a Python library for generating impact charts.
What are Impact Charts?
(See also the ACM FAccT paper for more background and details.)
Impact charts make it easy to take a data set and visualize the impact of one variable on another in ways that techniques like scatter plots and linear regression can't, especially when there are other variables involved.
For example, consider this impact chart, which looks at the impact of
x_3
on y
in a particular data set:
The green dots represent our best estimate of the impact. The grey dots around them represent the estimate of the impact based on many (in this case 50) different machine learning models. When they are close to the green dots, as on the left side of the chart it means there is strong agreement among the models as to the impact. When they are farther apart, as on the left side of the chart, there is less agreement.
The general shape of the curve of green dots, and the fact that the gray dots remain
rather tightly grouped around it, suggest that the impact of x_3
on y
is very limited
when x_3
is negative. But as it becomes increasingly positive, it's impact grows more
and more rapidly. It might even be exponential [Spoiler alert: this is synthetic data and
the impact is exponential.].
Now let's compare what we see in the impact chart to a traditional scatter plot, which
is a popular way of taking a first look at the relationship between variables.
Using the same data set, here is a scatter plot of y
vs. x_3
:
In the scatter plot, we also added regression curves for linear regression and
quadratic regression, but they did not tell us much. The reason is that
in the data set we are looking at, x_3
isn't the only feature that impacts
y
. There are other x_i
whose cumulative impact hides that of x_3
.
The impact chart, on the other hand, we see the impact of x_3
on y
independent
of the effect of any other x_i
.
The reason impact charts like the one we are looking at are so powerful is that they very clealy and directly show us the impact of one feature in a data set on a target of interest.
Impact charts can find impact even though, unlike parametric regression techniques,
they don't have any a priori knowledge of the shape of the impacts they are looking for.
For example, in the data set we have been looking at, there is another feature x_2
whose
impact on y
is sinusoidal. And the impact chart for it shows this clearly
In general, no matter what the shape of the impact of one variable on another, whether it is polynomial, exponential, logarithmic, polylogarithmic, has seasonality or other repetitive structure, if the impact is there you are going to get a good visualization with an impact chart.
And what if there is no impact? What if the two variables are not correlated at all. We can see an example of that case in the following impact chart:
In this case, the impact chart is essentially flat at zero. There is a little bit
of random noise resulting from the way we created and trained the underlying machine
learning models, but the fact that there x_4
had negligible impact on y
, at least
relative to the impacts x_2
and x_3
had, is quite clear.
What do Impact Charts on Real Data Look Like?
The data set we looked at above to introduce impact charts is synthetic. We deliberately constructed it so some of the variables had known impacts and others had none. It was part of a test to show that the ideas behind impact charts worked and that the code that we wrote to implement them was working.
But of course the real reason for impact charts is to look for impacts that we did not deliberately put there for the code to find. So what do impact charts look like on real data? One of the first data sets we tried it on consists of data about the median income of renters, the racial and ethnic make up of the population of renters, and the rate of eviction at the census tract level in hundreds of communities around the United States. We made impact charts at the county level to show the impact of income, race, and ethnicity on eviction.
Here is an example:
This chart looks at the impact of the median household income of renters (at the census tract level) on the eviction rate in DeKalb County, Georgia in years between 2009 and 2018. The impact is measured in units of eviction cases filed per 100 renters.
We might hypothesize that the lowest income renters would have the highest eviction rates. But this impact chart tells us otherwise. In very low income tracts, there is an impact around -5, meaning that these tracts have an eviction rate five points lower than they otherwise would. Why would that be? In some cases, those at the very bottom of the income scale live in public housing or voucher-supported housing (commonly known as Section-8). In these kind of settings, they have more protection against eviction than renters do on the open market.
Between $25,000 and $40,000, the impact is almost entirely above zero. These tracts experience higher rates of eviction because of their income level. This is consistent with a population of people who have enough income to pay rent under normal circumstances but not enough to absorb the kind of shock (a major car repair or medical bill) that those with higher income might have the savings or credit to survive without being evicted. Above $50,000 in household income, the curve is mostly flat at zero impact. This doesn't mean none of these people ever get evicted. It just means that it is not their income that is driving the eviction.
But between $50,00 and $100,000, the green dots are bifircated into a zero impact group and a group around -3 to -4. Usually this means there is some other variable that is causing the difference between the two, but it was not present in our data set, so the models did the best they could in assigning that impact. If we could add tha right variable to our data set, the effect would go away.
In just one chart, we can start to build a very interesting picture of the dynamics of the social and economic processes behind the data. And we can compare the impact of variables. Without going into all of the details we discuss elsewhere here is another impact chart from the same data set. But instead of looking at the impact of median renter income, we look at the impact of the fraction of renters in a census tract who identify as Black alone. The chart is:
Census tracts where very few Black renters live have a negative impact, meaning their eviction rates are lower. The dats set include, income, whose impact we already looked at, so this impact chart shows the impact that can be attributed to race independent of income, even if the two are correlated to some degree. Note also that the two impact charts share the same vertical scale, so that the magnitudes of the impacts can be compared.
If you would like to see more impact charts generated from this data set, please visit http://evl.datapinions.com/.
How Do I Make an Impact Chart
The first step is to install the impactchart
code in a virtual environment using
pip install impactchart
From there, the simplest way to make your first impact chart is to replicate the code that generated the first impact charts we saw above, using synthetic data. It is as follows:
from impactchart.model import XGBoostImpactModel
from impactchartdemo.dataset import synth1
# Generate the data set:
N = 200
X, y = synth1.get_data(N)
# Construct and fit the impact chart model:
impact_model = XGBoostImpactModel()
impact_model.fit(X, y)
# Plot the charts. The return value is a dictionary
# with one chart per column of X.
impact_charts = impact_model.impact_charts(X)
There are other options to control and format the charts, but the code above is sufficient to get the job done.
For more details, please see the
Synthetic Data.ipynb
notebook, from which the code above is derived.
How Do Impact Charts Work
Impact charts are built on top of shap, which uses Shapley values to interpret predictions made by machine learning models. For more details on how this is done, please see this paper and/or this blog post.
More on Impact Charts
Applications built on top of impactchart
be found in the
projects evlcharts and
rihcharts.
An earlier version of the code that led to what is now here produced the impact charts available at http://rih.datapinions.com/impact.html. This work, and the motivation for the impact chart approach, is discussed at length in the blog post Using Interpretable Machine Learning to Analyze Racial and Ethnic Disparities in Home Values.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file impactchart-0.6.0.tar.gz
.
File metadata
- Download URL: impactchart-0.6.0.tar.gz
- Upload date:
- Size: 31.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/21.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3315fcbd68c805863df7e104e1ba55247f5cdcfccb98038002caac97affaba33 |
|
MD5 | 8f00c8712ddb0a7f39219a94d137ddfb |
|
BLAKE2b-256 | 9883a5b1ae0eb026a3adc9ba9be0c81a0c51676455ef66a4dc96325194bef7c9 |
File details
Details for the file impactchart-0.6.0-py3-none-any.whl
.
File metadata
- Download URL: impactchart-0.6.0-py3-none-any.whl
- Upload date:
- Size: 30.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/21.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c2cca3266042fe4d3f0650f0c261a6e7c87bf9c771798e374a86153f8cca77ef |
|
MD5 | 2c2d8c9510662c0baa7643a6aade1983 |
|
BLAKE2b-256 | 6c89cb5d786dd3e7bcc73c0d7f2085ec7cdf007c6e82deaa3108c6dab09aa202 |