Skip to main content

Python module to imputate IPEDS unitids from non-matching institution names

Project description

get_unitids by Adam Hearn

A Python Module to Imputate IPEDS UnitID Numbers from Non-Matching Institution Names

Have you ever worked with institutional data from multiple sources? If so, one of them is likmayely IPEDS which of course involves the infamous unitid variable. The secondary source, on the other hand, may only have the institution's name and no unitid. In this case, to join the datasets, you would need to merge on institution name and fill in the rest of the unitids manually to retrieve the IPEDS data.

Anyone who has worked with IPEDS data would know that not all institution names perfectly line up across multiple sources. For example, Tulane University is named as Tulane University of Louisiana in the IPEDS universe. In this case of conflicting names, there would be an imperfect merge requiring you to manually enter Tulane's unitid number.

In my background, I've run into this issue several times and has gotten to the point where it would be a better use of time to create a module to automate this step rather than filling out unitids manually. That said, I've developed the Python module unitids. I'm making this process open-source so other higher-ed researchers can benefit, too.

This module, available in the pip library, uses a cosine similarity text-analyis metric to merge partial or "non-matching" institution names with an IPEDS master file including all institutions in the IPEDS universe since 2004 and their unitid numbers.

The process works by passing a DataFrame of institutions of which you want to get their unitids into the get_unitids function. From there, the function populates a sparse matrix and generates a cosine-similarity metric for each insitution you passed and each institution in the IPEDS universe.

It will then return two DataFrame objects: the first DataFrame will include your original data and the unitid numbers of the institutions in the dataset, along with a "match score" (displayed on a scale of 0-100).

The second DataFrame includes information on the institutions that were not a perfect match, alongside their top-5 closes matches so you can make adjustments as necessary.

Example

Take, for example, the data present in this article. The cleaned data is available on my Github here.

Suppose we want to see the relationshpi between Forbes' Financial GPA and endowment, as reported to IPEDS. I've included a sample dataset of FY2018 endowment for all institutions in the IPEDS universe here.

# Import necessary modules (no unitid, yet)
import pandas as pd
import numpy as np

# First we read in the Forbes data:
url = 'https://raw.githubusercontent.com/ahearn15/get_unitids/master/example_dta.csv'
forbes = pd.read_csv(url)
forbes.head(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
College State Financial GPA Financial Grade
0 Stanford University CA 4.5 A+
1 Massachusetts Institute of Technology MA 4.5 A+
2 University of Notre Dame IN 4.5 A+
3 Princeton University NJ 4.5 A+
4 University of Pennsylvania PA 4.5 A+
# Now we read in the IPEDS data
url = 'https://raw.githubusercontent.com/ahearn15/get_unitids/master/ipeds_example.csv'
ipeds = pd.read_csv(url).drop(columns = 'Unnamed: 0')
ipeds.head(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
unitid institution endowment
0 177834 A T Still University of Health Sciences 32420.0
1 180203 Aaniiih Nakoda College 9212.0
2 491464 ABC Adult School NaN
3 459523 ABC Beauty Academy NaN
4 485500 ABCO Technology NaN
# Now we merge together
forbes = forbes.rename(columns = {'College' : 'institution'}) # need to rename for merge
merged = pd.merge(forbes, ipeds, on = 'institution', how = "left")
merged.head(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
institution State Financial GPA Financial Grade unitid endowment
0 Stanford University CA 4.5 A+ 243744.0 1496715.0
1 Massachusetts Institute of Technology MA 4.5 A+ 166683.0 1454419.0
2 University of Notre Dame IN 4.5 A+ 152080.0 837184.0
3 Princeton University NJ 4.5 A+ 186131.0 3053449.0
4 University of Pennsylvania PA 4.5 A+ 215062.0 562092.0
# How many did not merge?
merged['unitid'].isna().sum()
40

If we merge our Forbes data with this 2018 list of IPEDS institutions, we get a successful merge rate of 95.7% (892 of 932 institutions). However, we still have 40 unitids we need to manually encode, taking up 40 lines of code and tedious trips to the IPEDS Data Center. What if we used the new get_unitids function though?

get_unitids

# Installing the module
!pip install unitids==0.0.92

# Import required functions
import numpy as np
import pandas as pd
from unitids import unitids

#For viewing nonmatches
pd.set_option('display.max_rows', 100)
Requirement already satisfied: unitids==0.0.92 in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (0.0.92)
Requirement already satisfied: numpy in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (from unitids==0.0.92) (1.18.1)
Requirement already satisfied: pandas in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (from unitids==0.0.92) (1.0.1)
Requirement already satisfied: nltk in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (from unitids==0.0.92) (3.4.5)
Requirement already satisfied: textblob in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (from unitids==0.0.92) (0.15.3)
Requirement already satisfied: pytz>=2017.2 in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (from pandas->unitids==0.0.92) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (from pandas->unitids==0.0.92) (2.8.1)
Requirement already satisfied: six in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (from nltk->unitids==0.0.92) (1.14.0)

Running the algorithm

The first argument we pass to the function, forbes, is our original dataset with the institutions of which we want to retreive unitids. The second argument, stateFlag, is an indicator of whether or not we have state abbreviations in our data. This makes the merge much faster and much cleaner, which I'll get to shortly.

The funciton returns two DataFrames: merged is our original dataset with the fancy new unitids. The second dataframe returned, nonmatches, allows us to investigate the institutions that did not perfectly merge and make adjustments as necessary. Now we're ready to call the function!

Sidenote: For the algorithm to run error-free, the institution name variable must be listed the first column and the state variable (if available) must be in the second column.

# Calling the function
forbes_unitids, nonmatches = unitids.get_unitids(forbes, stateFlag = True)

#viewing the new data
forbes_unitids.head(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
institution State Financial GPA Financial Grade unitid match
0 Stanford University CA 4.5 A+ 243744.0 100.0
1 Massachusetts Institute of Technology MA 4.5 A+ 166683.0 100.0
2 University of Notre Dame IN 4.5 A+ 152080.0 100.0
3 Princeton University NJ 4.5 A+ 186131.0 100.0
4 University of Pennsylvania PA 4.5 A+ 215062.0 100.0
# How many did not merge?
forbes_unitids['unitid'].isna().sum()
0

100% of the institutions merged! We have significantly fewer unitids we need to fill in ourselves. We can investigate the institutions that were not perfect merges by viewing the second DataFrame, nonmatches:

nonmatches
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
similarity unitid
institution match Top 5
Baptist College of Health Sciences (TN) 1.0 Baptist Memorial College of Health Sciences (TN) 0.925820 219639
2.0 Huntington College of Health Sciences (TN) 0.833333 488068
3.0 Huntington University of Health Sciences (TN) 0.666667 488068
4.0 American Baptist College (TN) 0.612372 219505
5.0 Fayetteville College of Cosmetology Arts and Sciences (TN) 0.577350 220163
Calumet College of St. Joseph (IN) 1.0 Calumet College of Saint Joseph (IN) 0.833333 150172
2.0 American College of Education (IN) 0.547723 449889
Lincoln College of Technology (IN) 0.547723 151661
3.0 College of Court Reporting Inc (IN) 0.500000 150251
PJ's College of Cosmetology (IN) 0.500000 152150
Franklin & Marshall College (PA) 1.0 Franklin and Marshall College (PA) 0.894427 212577
2.0 Harcum College (PA) 0.577350 212869
Juniata College (PA) 0.577350 213251
Rosemont College (PA) 0.577350 215691
Wilson College (PA) 0.577350 217013
Golden Gate University (CA) 1.0 Golden Gate University-San Francisco (CA) 0.816497 115083
2.0 Ashford University (CA) 0.577350 154022
Saybrook University (CA) 0.577350 123095
Simpson University (CA) 0.577350 123457
Stanbridge University (CA) 0.577350 446561
Hobart and William Smith Colleges (NY) 1.0 Hobart William Smith Colleges (NY) 0.912871 191630
2.0 The Sage Colleges (NY) 0.408248 195128
3.0 Berk Trade and Business School (NY) 0.333333 189219
Bryant and Stratton College-Lackawanna (NY) 0.333333 374972
Elim Bible Institute and College (NY) 0.333333 488305
Jefferson (Philadelphia University + Thomas Jefferson University) (PA) 1.0 Thomas Jefferson University (PA) 0.904534 216366
2.0 Philadelphia Biblical University-Langhorne (PA) 0.539360 215114
3.0 Bucknell University (PA) 0.522233 211291
Lehigh University (PA) 0.522233 213543
Widener University (PA) 0.522233 216852
LeMoyne-Owen College (TN) 1.0 Le Moyne-Owen College (TN) 0.670820 220604
2.0 Bethel College (TN) 0.577350 219718
Maryville College (TN) 0.577350 220710
Rhodes College (TN) 0.577350 221351
Welch College (TN) 0.577350 220206
Mount St Joseph University (OH) 1.0 Mount Saint Joseph University (OH) 0.800000 204200
2.0 College of Mount St Joseph (OH) 0.730297 204200
College of Mount St. Joseph (OH) 0.730297 204200
3.0 Mount Vernon Nazarene University (OH) 0.600000 204194
University of Mount Union (OH) 0.600000 204185
Paul Smith's College of Arts and Science (NY) 1.0 Paul Smiths College of Arts and Science (NY) 0.824958 194392
2.0 SUNY College of Environmental Science and Forestry (NY) 0.589256 196103
3.0 St. Joseph's College of Nursing (NY) 0.503953 195191
Vaughn College of Aeronautics and Technology (NY) 0.503953 188340
4.0 St. Peter's Hospital College of Nursing (NY) 0.471405 192961
Saint John's University (MN) 1.0 Saint Mary's University of Minnesota (MN) 0.676123 174817
2.0 Saint Johns University (MN) 0.670820 174792
3.0 Concordia University-Saint Paul (MN) 0.600000 173328
Saint Cloud State University (MN) 0.600000 174783
4.0 Herzing University (MN) 0.516398 174154
St. Ambrose University (IA) 1.0 Saint Ambrose University (IA) 0.750000 154235
2.0 Dordt University (IA) 0.577350 153250
Drake University (IA) 0.577350 153269
Shiloh University (IA) 0.577350 480499
Waldorf University (IA) 0.577350 154518
St. Edward's University (TX) 1.0 Saint Edward's University (TX) 0.800000 227845
St Mary's University (TX) 0.800000 228149
St. Mary's University (TX) 0.800000 228149
2.0 King's University (TX) 0.670820 439701
St Marys University (TX) 0.670820 228149
St. John Fisher College (NY) 1.0 Saint John Fisher College (NY) 0.800000 195720
2.0 St Francis College (NY) 0.670820 195173
St. Francis College (NY) 0.670820 195173
3.0 St Thomas Aquinas College (NY) 0.600000 195243
St. Thomas Aquinas College (NY) 0.600000 195243
St. Norbert College (WI) 1.0 Saint Norbert College (WI) 0.750000 239716
2.0 Carroll College (WI) 0.577350 238458
Edgewood College (WI) 0.577350 238661
Lakeland College (WI) 0.577350 238980
Shepherds College (WI) 0.577350 481137
Tulane University (LA) 1.0 Tulane University of Louisiana (LA) 0.774597 160755
2.0 Dillard University (LA) 0.666667 158802
Herzing University (LA) 0.666667 433536
3.0 McNeese State University (LA) 0.577350 159717
Nicholls State University (LA) 0.577350 159966

For example, we can see that Franklin & Marshall College merged successfully with its official name in the IPEDS universe, "Franklin and Marshall College". Also, Tulane University merged successfully with "Tulane University of Louisiana", Hobart and William Smith Colleges merged with "Hobart William Smith Colleges", etc.

Further, the algorithm accounts for institutions with changed names. For example, Dordt did not merge in the original dataset (since Forbes called it "Dordt College" and it's official IPEDS name is "Dordt University". The same issue is raised with Calvin College (now Calvin Univeristy). The IPEDS dictionary nested in the algorithm accounts for these historical name-changes. That is why there are only 15 non-perfect matches with this algorithm as opposed to 40 merging the "old fashioned way".

However, this wasn't perfect: Saint John's University (MN) merged with Saint Mary's University of Minnesota instead, so we will need to correct that ourselves. We can find the correct unitid in the nonmatches dataframe above. Still, one line of code compared to 40 is a big time-saver!

# Replaces unitid for Saint John's University (MN) only
# For Stata folks, same as replace unitid = 174792 if institution == "Saint John's University (MN)"
forbes_unitids['untid'] = np.where(forbes_unitids['institution'] == "Saint John's University (MN)", 174792, forbes_unitids['unitid'])

Running this algorithm on this dataset as opposed to merging on institution-name gives us an accuracy of 99.9% (931 of 932 institutions), up from 95.7% earlier (892 of 932 institutions). It's a marginal improvement, but a big time-saver.

To answer our original research question of how endowment impacts Forbes' "Financial GPA" measure, we can merge in our IPEDS data cleanly here.

dta = pd.merge(forbes_unitids, ipeds, on = 'unitid', how = 'left')
dta = dta.drop(columns = "institution_y") # no need for duplicate column
dta = dta.rename(columns = {"institution_x" : "institution"}) # renaming
dta.head(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
institution State Financial GPA Financial Grade unitid match untid endowment
0 Stanford University CA 4.5 A+ 243744.0 100.0 243744.0 1496715.0
1 Massachusetts Institute of Technology MA 4.5 A+ 166683.0 100.0 166683.0 1454419.0
2 University of Notre Dame IN 4.5 A+ 152080.0 100.0 152080.0 837184.0
3 Princeton University NJ 4.5 A+ 186131.0 100.0 186131.0 3053449.0
4 University of Pennsylvania PA 4.5 A+ 215062.0 100.0 215062.0 562092.0
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

#using logged endowment, as any sane econometrician would
sns.scatterplot(x=np.log(dta["endowment"]), y=dta["Financial GPA"]) 
<matplotlib.axes._subplots.AxesSubplot at 0x1a19878950>

png

Seems like endowment plays a pretty significant role in Forbes' grading of Financial GPA!

What if we have no state data?

Note that the original merge went very well mostly due to us having access to state codes in our secondary dataset. Suppose our Forbes data did not have state codes, in which case the merge would have gone like this:

forbes_unitids, nonmatches = unitids.get_unitids(forbes, stateFlag = False)

Running the algorithm with stateFlag = False takes significantly longer, considering we can no longer "throw out" institutions that do not match the same state. Instead, the algorithm must cross-check institutions across all states, not just the ones within states like it did when stateFlag was set to True.

nonmatches
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
similarity unitid
institution match Top 5
Baptist College of Health Sciences 1.0 Baptist Memorial College of Health Sciences (TN) 0.912871 219639
2.0 Australasian College of Health Sciences (OR) 0.800000 443599
Bryan College of Health Sciences (NE) 0.800000 180878
Jefferson College of Health Sciences (VA) 0.800000 231837
Sentara College of Health Sciences (VA) 0.800000 232885
Calumet College of St. Joseph 1.0 College of St Joseph (VT) 0.894427 231077
2.0 Calumet College of Saint Joseph (IN) 0.800000 150172
College of Mount St Joseph (OH) 0.800000 204200
College of Mount St. Joseph (OH) 0.800000 204200
3.0 St. Joseph's College of Nursing (NY) 0.730297 195191
Franklin & Marshall College 1.0 Franklin and Marshall College (PA) 0.866025 212577
2.0 Franklin College (IN) 0.816497 150604
Franklin College (TX) 0.816497 225788
3.0 Franklin Pierce College (NH) 0.666667 182795
4.0 Marshall Community and Technical College (WV) 0.516398 444954
Golden Gate University 1.0 Golden Gate University-San Francisco (CA) 0.774597 115083
2.0 Chapman University-University College (CA) 0.471405 262086
3.0 Indiana University-Purdue University-Indianapolis (IN) 0.436436 151111
University of Illinois University Administration (IL) 0.436436 149587
University of Maryland-University College (MD) 0.436436 163204
Hobart and William Smith Colleges 1.0 Hobart William Smith Colleges (NY) 0.894427 191630
2.0 College of William and Mary (VA) 0.400000 231624
3.0 Texas Barber Colleges and Hairstyling Schools (TX) 0.365148 440989
4.0 Richard Bland College of William and Mary (VA) 0.338062 233338
State Board for Community and Junior Colleges (MS) 0.338062 247737
Jefferson (Philadelphia University + Thomas Jefferson University) 1.0 Thomas Jefferson University (PA) 0.912871 216366
2.0 Thomas University (GA) 0.670820 141167
3.0 Saint Thomas University (FL) 0.547723 137476
St Thomas University (FL) 0.547723 137476
St. Thomas University (FL) 0.547723 137476
LeMoyne-Owen College 1.0 Le Moyne-Owen College (TN) 0.577350 220604
2.0 Hussian College-Daymar College Clarksville (TN) 0.436436 368443
Hussian College-Daymar College Columbus (OH) 0.436436 205559
Hussian College-Daymar College Murfreesboro (TN) 0.436436 444255
Hussian College-Daymar College Nashville (TN) 0.436436 220002
Mount St Joseph University 1.0 Mount Saint Joseph University (OH) 0.750000 204200
2.0 College of Mount St Joseph (OH) 0.670820 204200
College of Mount St. Joseph (OH) 0.670820 204200
Mount St Mary's University (MD) 0.670820 163462
Mount St. Mary's University (MD) 0.670820 163462
Paul Smith's College of Arts and Science 1.0 Paul Smiths College of Arts and Science (NY) 0.801784 194392
2.0 University of Science and Arts of Oklahoma (OK) 0.589256 207722
3.0 Hope College of Arts and Sciences (FL) 0.577350 488332
4.0 Baker University College of Arts and Sciences (KS) 0.534522 154688
Mayo Clinic College of Medicine and Science (MN) 0.534522 173957
Saint John's University 1.0 Saint Edward's University (TX) 0.750000 227845
Saint John's Seminary (MA) 0.750000 167677
Saint Joseph's University (PA) 0.750000 215770
Saint Martin's University (WA) 0.750000 236452
Saint Peter's University (NJ) 0.750000 186432
St. Ambrose University 1.0 St Bonaventure University (NY) 0.666667 195164
St Lawrence University (NY) 0.666667 195216
St. Andrews University (NC) 0.666667 199698
St. Catherine University (MN) 0.666667 175005
St. Thomas University (FL) 0.666667 137476
St. Edward's University 1.0 Saint Edward's University (TX) 0.750000 227845
St Mary's University (TX) 0.750000 228149
St. Mary's University (TX) 0.750000 228149
2.0 Mount St Mary's University (MD) 0.670820 163462
Mount St. Mary's University (MD) 0.670820 163462
St. John Fisher College 1.0 Saint John Fisher College (NY) 0.750000 195720
St John's College (MD) 0.750000 163976
St John's College (NM) 0.750000 245652
St. John's College (MD) 0.750000 163976
St. John's College (NM) 0.750000 245652
St. Norbert College 1.0 Saint Norbert College (WI) 0.666667 239716
St Johns College (IL) 0.666667 148593
St Petersburg College (FL) 0.666667 137078
St Philips College (TX) 0.666667 227854
St. Olaf College (MN) 0.666667 174844
Tulane University 1.0 Tulane University of Louisiana (LA) 0.707107 160755
2.0 Chapman University-University College (CA) 0.577350 262086
3.0 Indiana University-Purdue University-Indianapolis (IN) 0.534522 151111
University of Illinois University Administration (IL) 0.534522 149587
University of Maryland-University College (MD) 0.534522 163204
Union College 1.0 Union College (KY) 1.000000 157863
Union College (NE) 1.000000 181738
Union College (NY) 1.000000 196866
2.0 Columbia Union College (MD) 0.816497 162210
Union County College (NJ) 0.816497 187198
University of St. Thomas 1.0 University of St Thomas (MN) 1.000000 174914
University of St Thomas (TX) 1.000000 227863
University of St. Thomas (MN) 1.000000 174914
2.0 St Thomas University (FL) 0.866025 137476
St. Thomas University (FL) 0.866025 137476
Westminster College 1.0 Westminster College (MO) 1.000000 179946
Westminster College (PA) 1.000000 216807
Westminster College (UT) 1.000000 230807
2.0 Utah College of Massage Therapy-Westminster (CO) 0.577350 469151
3.0 Hussian College-Daymar College Nashville (TN) 0.534522 220002
Wheaton College 1.0 Wheaton College (IL) 1.000000 149781
Wheaton College (MA) 1.000000 168281
2.0 Hussian College-Daymar College Clarksville (TN) 0.534522 368443
Hussian College-Daymar College Columbus (OH) 0.534522 205559
Hussian College-Daymar College Murfreesboro (TN) 0.534522 444255

There were still 913 perfect merges, but the accuracy on imputing the remaining 19 unitids is not as high: we only got 13 of these (accuracy of 68% within the nonperfect matches, 99.4% for the whole dataset).

For example, Calumet College of St. Joseph merged with College of St. Joseph (VT) when it should have merged with Calumet College of Saint Joseph (IN). Plus, we must now account for institutions with the same name, such as Wheaton College (IL and MA). Still, not terrible, but it is alwasy a good idea to check for what unitids were imputed. Luckily this module makes it easy to do that.

Conclusion

I hope this can be a valuable tool for higher ed researchers out there. While this is currently only available in Python, I hope to get around to making it available on Stata and R soon. Please leave any feedback or bug reports on my GitHub repository here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unitids-1.0.5.tar.gz (128.5 kB view hashes)

Uploaded Source

Built Distribution

unitids-1.0.5-py3-none-any.whl (115.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page