Python module to imputate IPEDS unitids from non-matching institution names
Project description
get_unitids by Adam Hearn
A Python Module to Imputate IPEDS UnitID Numbers from Non-Matching Institution Names
Have you ever worked with institutional data from multiple sources? If so, one of them is likmayely IPEDS which of course involves the infamous unitid
variable. The secondary source, on the other hand, may only have the institution's name and no unitid
. In this case, to join the datasets, you would need to merge on institution name and fill in the rest of the unitids manually to retrieve the IPEDS data.
Anyone who has worked with IPEDS data would know that not all institution names perfectly line up across multiple sources. For example, Tulane University is named as Tulane University of Louisiana in the IPEDS universe. In this case of conflicting names, there would be an imperfect merge requiring you to manually enter Tulane's unitid number.
In my background, I've run into this issue several times and has gotten to the point where it would be a better use of time to create a module to automate this step rather than filling out unitids manually. That said, I've developed the Python module unitids
. I'm making this process open-source so other higher-ed researchers can benefit, too.
This module, available in the pip
library, uses a cosine similarity text-analyis metric to merge partial or "non-matching" institution names with an IPEDS master file including all institutions in the IPEDS universe since 2004 and their unitid numbers.
The process works by passing a DataFrame of institutions of which you want to get their unitids into the get_unitids
function. From there, the function populates a sparse matrix and generates a cosine-similarity metric for each insitution you passed and each institution in the IPEDS universe.
It will then return two DataFrame objects: the first DataFrame will include your original data and the unitid numbers of the institutions in the dataset, along with a "match score" (displayed on a scale of 0-100).
The second DataFrame includes information on the institutions that were not a perfect match, alongside their top-5 closes matches so you can make adjustments as necessary.
Example
Take, for example, the data present in this article. The cleaned data is available on my Github here.
Suppose we want to see the relationshpi between Forbes' Financial GPA and endowment, as reported to IPEDS. I've included a sample dataset of FY2018 endowment for all institutions in the IPEDS universe here.
# Import necessary modules (no unitid, yet)
import pandas as pd
import numpy as np
# First we read in the Forbes data:
url = 'https://raw.githubusercontent.com/ahearn15/get_unitids/master/example_dta.csv'
forbes = pd.read_csv(url)
forbes.head(5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
College | State | Financial GPA | Financial Grade | |
---|---|---|---|---|
0 | Stanford University | CA | 4.5 | A+ |
1 | Massachusetts Institute of Technology | MA | 4.5 | A+ |
2 | University of Notre Dame | IN | 4.5 | A+ |
3 | Princeton University | NJ | 4.5 | A+ |
4 | University of Pennsylvania | PA | 4.5 | A+ |
# Now we read in the IPEDS data
url = 'https://raw.githubusercontent.com/ahearn15/get_unitids/master/ipeds_example.csv'
ipeds = pd.read_csv(url).drop(columns = 'Unnamed: 0')
ipeds.head(5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
unitid | institution | endowment | |
---|---|---|---|
0 | 177834 | A T Still University of Health Sciences | 32420.0 |
1 | 180203 | Aaniiih Nakoda College | 9212.0 |
2 | 491464 | ABC Adult School | NaN |
3 | 459523 | ABC Beauty Academy | NaN |
4 | 485500 | ABCO Technology | NaN |
# Now we merge together
forbes = forbes.rename(columns = {'College' : 'institution'}) # need to rename for merge
merged = pd.merge(forbes, ipeds, on = 'institution', how = "left")
merged.head(5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
institution | State | Financial GPA | Financial Grade | unitid | endowment | |
---|---|---|---|---|---|---|
0 | Stanford University | CA | 4.5 | A+ | 243744.0 | 1496715.0 |
1 | Massachusetts Institute of Technology | MA | 4.5 | A+ | 166683.0 | 1454419.0 |
2 | University of Notre Dame | IN | 4.5 | A+ | 152080.0 | 837184.0 |
3 | Princeton University | NJ | 4.5 | A+ | 186131.0 | 3053449.0 |
4 | University of Pennsylvania | PA | 4.5 | A+ | 215062.0 | 562092.0 |
# How many did not merge?
merged['unitid'].isna().sum()
40
If we merge our Forbes data with this 2018 list of IPEDS institutions, we get a successful merge rate of 95.7% (892 of 932 institutions). However, we still have 40 unitids we need to manually encode, taking up 40 lines of code and tedious trips to the IPEDS Data Center. What if we used the new get_unitids
function though?
get_unitids
# Installing the module
!pip install unitids==0.0.92
# Import required functions
import numpy as np
import pandas as pd
from unitids import unitids
#For viewing nonmatches
pd.set_option('display.max_rows', 100)
Requirement already satisfied: unitids==0.0.92 in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (0.0.92)
Requirement already satisfied: numpy in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (from unitids==0.0.92) (1.18.1)
Requirement already satisfied: pandas in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (from unitids==0.0.92) (1.0.1)
Requirement already satisfied: nltk in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (from unitids==0.0.92) (3.4.5)
Requirement already satisfied: textblob in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (from unitids==0.0.92) (0.15.3)
Requirement already satisfied: pytz>=2017.2 in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (from pandas->unitids==0.0.92) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (from pandas->unitids==0.0.92) (2.8.1)
Requirement already satisfied: six in /Users/adamhearn/anaconda3/lib/python3.7/site-packages (from nltk->unitids==0.0.92) (1.14.0)
Running the algorithm
The first argument we pass to the function, forbes
, is our original dataset with the institutions of which we want to retreive unitids. The second argument, stateFlag
, is an indicator of whether or not we have state abbreviations in our data. This makes the merge much faster and much cleaner, which I'll get to shortly.
The funciton returns two DataFrames: merged
is our original dataset with the fancy new unitids. The second dataframe returned, nonmatches
, allows us to investigate the institutions that did not perfectly merge and make adjustments as necessary. Now we're ready to call the function!
Sidenote: For the algorithm to run error-free, the institution name variable must be listed the first column and the state variable (if available) must be in the second column.
# Calling the function
forbes_unitids, nonmatches = unitids.get_unitids(forbes, stateFlag = True)
#viewing the new data
forbes_unitids.head(5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
institution | State | Financial GPA | Financial Grade | unitid | match | |
---|---|---|---|---|---|---|
0 | Stanford University | CA | 4.5 | A+ | 243744.0 | 100.0 |
1 | Massachusetts Institute of Technology | MA | 4.5 | A+ | 166683.0 | 100.0 |
2 | University of Notre Dame | IN | 4.5 | A+ | 152080.0 | 100.0 |
3 | Princeton University | NJ | 4.5 | A+ | 186131.0 | 100.0 |
4 | University of Pennsylvania | PA | 4.5 | A+ | 215062.0 | 100.0 |
# How many did not merge?
forbes_unitids['unitid'].isna().sum()
0
100% of the institutions merged! We have significantly fewer unitids we need to fill in ourselves. We can investigate the institutions that were not perfect merges by viewing the second DataFrame, nonmatches
:
nonmatches
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
similarity | unitid | |||
---|---|---|---|---|
institution | match | Top 5 | ||
Baptist College of Health Sciences (TN) | 1.0 | Baptist Memorial College of Health Sciences (TN) | 0.925820 | 219639 |
2.0 | Huntington College of Health Sciences (TN) | 0.833333 | 488068 | |
3.0 | Huntington University of Health Sciences (TN) | 0.666667 | 488068 | |
4.0 | American Baptist College (TN) | 0.612372 | 219505 | |
5.0 | Fayetteville College of Cosmetology Arts and Sciences (TN) | 0.577350 | 220163 | |
Calumet College of St. Joseph (IN) | 1.0 | Calumet College of Saint Joseph (IN) | 0.833333 | 150172 |
2.0 | American College of Education (IN) | 0.547723 | 449889 | |
Lincoln College of Technology (IN) | 0.547723 | 151661 | ||
3.0 | College of Court Reporting Inc (IN) | 0.500000 | 150251 | |
PJ's College of Cosmetology (IN) | 0.500000 | 152150 | ||
Franklin & Marshall College (PA) | 1.0 | Franklin and Marshall College (PA) | 0.894427 | 212577 |
2.0 | Harcum College (PA) | 0.577350 | 212869 | |
Juniata College (PA) | 0.577350 | 213251 | ||
Rosemont College (PA) | 0.577350 | 215691 | ||
Wilson College (PA) | 0.577350 | 217013 | ||
Golden Gate University (CA) | 1.0 | Golden Gate University-San Francisco (CA) | 0.816497 | 115083 |
2.0 | Ashford University (CA) | 0.577350 | 154022 | |
Saybrook University (CA) | 0.577350 | 123095 | ||
Simpson University (CA) | 0.577350 | 123457 | ||
Stanbridge University (CA) | 0.577350 | 446561 | ||
Hobart and William Smith Colleges (NY) | 1.0 | Hobart William Smith Colleges (NY) | 0.912871 | 191630 |
2.0 | The Sage Colleges (NY) | 0.408248 | 195128 | |
3.0 | Berk Trade and Business School (NY) | 0.333333 | 189219 | |
Bryant and Stratton College-Lackawanna (NY) | 0.333333 | 374972 | ||
Elim Bible Institute and College (NY) | 0.333333 | 488305 | ||
Jefferson (Philadelphia University + Thomas Jefferson University) (PA) | 1.0 | Thomas Jefferson University (PA) | 0.904534 | 216366 |
2.0 | Philadelphia Biblical University-Langhorne (PA) | 0.539360 | 215114 | |
3.0 | Bucknell University (PA) | 0.522233 | 211291 | |
Lehigh University (PA) | 0.522233 | 213543 | ||
Widener University (PA) | 0.522233 | 216852 | ||
LeMoyne-Owen College (TN) | 1.0 | Le Moyne-Owen College (TN) | 0.670820 | 220604 |
2.0 | Bethel College (TN) | 0.577350 | 219718 | |
Maryville College (TN) | 0.577350 | 220710 | ||
Rhodes College (TN) | 0.577350 | 221351 | ||
Welch College (TN) | 0.577350 | 220206 | ||
Mount St Joseph University (OH) | 1.0 | Mount Saint Joseph University (OH) | 0.800000 | 204200 |
2.0 | College of Mount St Joseph (OH) | 0.730297 | 204200 | |
College of Mount St. Joseph (OH) | 0.730297 | 204200 | ||
3.0 | Mount Vernon Nazarene University (OH) | 0.600000 | 204194 | |
University of Mount Union (OH) | 0.600000 | 204185 | ||
Paul Smith's College of Arts and Science (NY) | 1.0 | Paul Smiths College of Arts and Science (NY) | 0.824958 | 194392 |
2.0 | SUNY College of Environmental Science and Forestry (NY) | 0.589256 | 196103 | |
3.0 | St. Joseph's College of Nursing (NY) | 0.503953 | 195191 | |
Vaughn College of Aeronautics and Technology (NY) | 0.503953 | 188340 | ||
4.0 | St. Peter's Hospital College of Nursing (NY) | 0.471405 | 192961 | |
Saint John's University (MN) | 1.0 | Saint Mary's University of Minnesota (MN) | 0.676123 | 174817 |
2.0 | Saint Johns University (MN) | 0.670820 | 174792 | |
3.0 | Concordia University-Saint Paul (MN) | 0.600000 | 173328 | |
Saint Cloud State University (MN) | 0.600000 | 174783 | ||
4.0 | Herzing University (MN) | 0.516398 | 174154 | |
St. Ambrose University (IA) | 1.0 | Saint Ambrose University (IA) | 0.750000 | 154235 |
2.0 | Dordt University (IA) | 0.577350 | 153250 | |
Drake University (IA) | 0.577350 | 153269 | ||
Shiloh University (IA) | 0.577350 | 480499 | ||
Waldorf University (IA) | 0.577350 | 154518 | ||
St. Edward's University (TX) | 1.0 | Saint Edward's University (TX) | 0.800000 | 227845 |
St Mary's University (TX) | 0.800000 | 228149 | ||
St. Mary's University (TX) | 0.800000 | 228149 | ||
2.0 | King's University (TX) | 0.670820 | 439701 | |
St Marys University (TX) | 0.670820 | 228149 | ||
St. John Fisher College (NY) | 1.0 | Saint John Fisher College (NY) | 0.800000 | 195720 |
2.0 | St Francis College (NY) | 0.670820 | 195173 | |
St. Francis College (NY) | 0.670820 | 195173 | ||
3.0 | St Thomas Aquinas College (NY) | 0.600000 | 195243 | |
St. Thomas Aquinas College (NY) | 0.600000 | 195243 | ||
St. Norbert College (WI) | 1.0 | Saint Norbert College (WI) | 0.750000 | 239716 |
2.0 | Carroll College (WI) | 0.577350 | 238458 | |
Edgewood College (WI) | 0.577350 | 238661 | ||
Lakeland College (WI) | 0.577350 | 238980 | ||
Shepherds College (WI) | 0.577350 | 481137 | ||
Tulane University (LA) | 1.0 | Tulane University of Louisiana (LA) | 0.774597 | 160755 |
2.0 | Dillard University (LA) | 0.666667 | 158802 | |
Herzing University (LA) | 0.666667 | 433536 | ||
3.0 | McNeese State University (LA) | 0.577350 | 159717 | |
Nicholls State University (LA) | 0.577350 | 159966 |
For example, we can see that Franklin & Marshall College merged successfully with its official name in the IPEDS universe, "Franklin and Marshall College". Also, Tulane University merged successfully with "Tulane University of Louisiana", Hobart and William Smith Colleges merged with "Hobart William Smith Colleges", etc.
Further, the algorithm accounts for institutions with changed names. For example, Dordt did not merge in the original dataset (since Forbes called it "Dordt College" and it's official IPEDS name is "Dordt University". The same issue is raised with Calvin College (now Calvin Univeristy). The IPEDS dictionary nested in the algorithm accounts for these historical name-changes. That is why there are only 15 non-perfect matches with this algorithm as opposed to 40 merging the "old fashioned way".
However, this wasn't perfect: Saint John's University (MN) merged with Saint Mary's University of Minnesota instead, so we will need to correct that ourselves. We can find the correct unitid in the nonmatches
dataframe above. Still, one line of code compared to 40 is a big time-saver!
# Replaces unitid for Saint John's University (MN) only
# For Stata folks, same as replace unitid = 174792 if institution == "Saint John's University (MN)"
forbes_unitids['untid'] = np.where(forbes_unitids['institution'] == "Saint John's University (MN)", 174792, forbes_unitids['unitid'])
Running this algorithm on this dataset as opposed to merging on institution-name gives us an accuracy of 99.9% (931 of 932 institutions), up from 95.7% earlier (892 of 932 institutions). It's a marginal improvement, but a big time-saver.
To answer our original research question of how endowment impacts Forbes' "Financial GPA" measure, we can merge in our IPEDS data cleanly here.
dta = pd.merge(forbes_unitids, ipeds, on = 'unitid', how = 'left')
dta = dta.drop(columns = "institution_y") # no need for duplicate column
dta = dta.rename(columns = {"institution_x" : "institution"}) # renaming
dta.head(5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
institution | State | Financial GPA | Financial Grade | unitid | match | untid | endowment | |
---|---|---|---|---|---|---|---|---|
0 | Stanford University | CA | 4.5 | A+ | 243744.0 | 100.0 | 243744.0 | 1496715.0 |
1 | Massachusetts Institute of Technology | MA | 4.5 | A+ | 166683.0 | 100.0 | 166683.0 | 1454419.0 |
2 | University of Notre Dame | IN | 4.5 | A+ | 152080.0 | 100.0 | 152080.0 | 837184.0 |
3 | Princeton University | NJ | 4.5 | A+ | 186131.0 | 100.0 | 186131.0 | 3053449.0 |
4 | University of Pennsylvania | PA | 4.5 | A+ | 215062.0 | 100.0 | 215062.0 | 562092.0 |
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
#using logged endowment, as any sane econometrician would
sns.scatterplot(x=np.log(dta["endowment"]), y=dta["Financial GPA"])
<matplotlib.axes._subplots.AxesSubplot at 0x1a19878950>
Seems like endowment plays a pretty significant role in Forbes' grading of Financial GPA!
What if we have no state data?
Note that the original merge went very well mostly due to us having access to state codes in our secondary dataset. Suppose our Forbes data did not have state codes, in which case the merge would have gone like this:
forbes_unitids, nonmatches = unitids.get_unitids(forbes, stateFlag = False)
Running the algorithm with stateFlag = False
takes significantly longer, considering we can no longer "throw out" institutions that do not match the same state. Instead, the algorithm must cross-check institutions across all states, not just the ones within states like it did when stateFlag
was set to True
.
nonmatches
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
similarity | unitid | |||
---|---|---|---|---|
institution | match | Top 5 | ||
Baptist College of Health Sciences | 1.0 | Baptist Memorial College of Health Sciences (TN) | 0.912871 | 219639 |
2.0 | Australasian College of Health Sciences (OR) | 0.800000 | 443599 | |
Bryan College of Health Sciences (NE) | 0.800000 | 180878 | ||
Jefferson College of Health Sciences (VA) | 0.800000 | 231837 | ||
Sentara College of Health Sciences (VA) | 0.800000 | 232885 | ||
Calumet College of St. Joseph | 1.0 | College of St Joseph (VT) | 0.894427 | 231077 |
2.0 | Calumet College of Saint Joseph (IN) | 0.800000 | 150172 | |
College of Mount St Joseph (OH) | 0.800000 | 204200 | ||
College of Mount St. Joseph (OH) | 0.800000 | 204200 | ||
3.0 | St. Joseph's College of Nursing (NY) | 0.730297 | 195191 | |
Franklin & Marshall College | 1.0 | Franklin and Marshall College (PA) | 0.866025 | 212577 |
2.0 | Franklin College (IN) | 0.816497 | 150604 | |
Franklin College (TX) | 0.816497 | 225788 | ||
3.0 | Franklin Pierce College (NH) | 0.666667 | 182795 | |
4.0 | Marshall Community and Technical College (WV) | 0.516398 | 444954 | |
Golden Gate University | 1.0 | Golden Gate University-San Francisco (CA) | 0.774597 | 115083 |
2.0 | Chapman University-University College (CA) | 0.471405 | 262086 | |
3.0 | Indiana University-Purdue University-Indianapolis (IN) | 0.436436 | 151111 | |
University of Illinois University Administration (IL) | 0.436436 | 149587 | ||
University of Maryland-University College (MD) | 0.436436 | 163204 | ||
Hobart and William Smith Colleges | 1.0 | Hobart William Smith Colleges (NY) | 0.894427 | 191630 |
2.0 | College of William and Mary (VA) | 0.400000 | 231624 | |
3.0 | Texas Barber Colleges and Hairstyling Schools (TX) | 0.365148 | 440989 | |
4.0 | Richard Bland College of William and Mary (VA) | 0.338062 | 233338 | |
State Board for Community and Junior Colleges (MS) | 0.338062 | 247737 | ||
Jefferson (Philadelphia University + Thomas Jefferson University) | 1.0 | Thomas Jefferson University (PA) | 0.912871 | 216366 |
2.0 | Thomas University (GA) | 0.670820 | 141167 | |
3.0 | Saint Thomas University (FL) | 0.547723 | 137476 | |
St Thomas University (FL) | 0.547723 | 137476 | ||
St. Thomas University (FL) | 0.547723 | 137476 | ||
LeMoyne-Owen College | 1.0 | Le Moyne-Owen College (TN) | 0.577350 | 220604 |
2.0 | Hussian College-Daymar College Clarksville (TN) | 0.436436 | 368443 | |
Hussian College-Daymar College Columbus (OH) | 0.436436 | 205559 | ||
Hussian College-Daymar College Murfreesboro (TN) | 0.436436 | 444255 | ||
Hussian College-Daymar College Nashville (TN) | 0.436436 | 220002 | ||
Mount St Joseph University | 1.0 | Mount Saint Joseph University (OH) | 0.750000 | 204200 |
2.0 | College of Mount St Joseph (OH) | 0.670820 | 204200 | |
College of Mount St. Joseph (OH) | 0.670820 | 204200 | ||
Mount St Mary's University (MD) | 0.670820 | 163462 | ||
Mount St. Mary's University (MD) | 0.670820 | 163462 | ||
Paul Smith's College of Arts and Science | 1.0 | Paul Smiths College of Arts and Science (NY) | 0.801784 | 194392 |
2.0 | University of Science and Arts of Oklahoma (OK) | 0.589256 | 207722 | |
3.0 | Hope College of Arts and Sciences (FL) | 0.577350 | 488332 | |
4.0 | Baker University College of Arts and Sciences (KS) | 0.534522 | 154688 | |
Mayo Clinic College of Medicine and Science (MN) | 0.534522 | 173957 | ||
Saint John's University | 1.0 | Saint Edward's University (TX) | 0.750000 | 227845 |
Saint John's Seminary (MA) | 0.750000 | 167677 | ||
Saint Joseph's University (PA) | 0.750000 | 215770 | ||
Saint Martin's University (WA) | 0.750000 | 236452 | ||
Saint Peter's University (NJ) | 0.750000 | 186432 | ||
St. Ambrose University | 1.0 | St Bonaventure University (NY) | 0.666667 | 195164 |
St Lawrence University (NY) | 0.666667 | 195216 | ||
St. Andrews University (NC) | 0.666667 | 199698 | ||
St. Catherine University (MN) | 0.666667 | 175005 | ||
St. Thomas University (FL) | 0.666667 | 137476 | ||
St. Edward's University | 1.0 | Saint Edward's University (TX) | 0.750000 | 227845 |
St Mary's University (TX) | 0.750000 | 228149 | ||
St. Mary's University (TX) | 0.750000 | 228149 | ||
2.0 | Mount St Mary's University (MD) | 0.670820 | 163462 | |
Mount St. Mary's University (MD) | 0.670820 | 163462 | ||
St. John Fisher College | 1.0 | Saint John Fisher College (NY) | 0.750000 | 195720 |
St John's College (MD) | 0.750000 | 163976 | ||
St John's College (NM) | 0.750000 | 245652 | ||
St. John's College (MD) | 0.750000 | 163976 | ||
St. John's College (NM) | 0.750000 | 245652 | ||
St. Norbert College | 1.0 | Saint Norbert College (WI) | 0.666667 | 239716 |
St Johns College (IL) | 0.666667 | 148593 | ||
St Petersburg College (FL) | 0.666667 | 137078 | ||
St Philips College (TX) | 0.666667 | 227854 | ||
St. Olaf College (MN) | 0.666667 | 174844 | ||
Tulane University | 1.0 | Tulane University of Louisiana (LA) | 0.707107 | 160755 |
2.0 | Chapman University-University College (CA) | 0.577350 | 262086 | |
3.0 | Indiana University-Purdue University-Indianapolis (IN) | 0.534522 | 151111 | |
University of Illinois University Administration (IL) | 0.534522 | 149587 | ||
University of Maryland-University College (MD) | 0.534522 | 163204 | ||
Union College | 1.0 | Union College (KY) | 1.000000 | 157863 |
Union College (NE) | 1.000000 | 181738 | ||
Union College (NY) | 1.000000 | 196866 | ||
2.0 | Columbia Union College (MD) | 0.816497 | 162210 | |
Union County College (NJ) | 0.816497 | 187198 | ||
University of St. Thomas | 1.0 | University of St Thomas (MN) | 1.000000 | 174914 |
University of St Thomas (TX) | 1.000000 | 227863 | ||
University of St. Thomas (MN) | 1.000000 | 174914 | ||
2.0 | St Thomas University (FL) | 0.866025 | 137476 | |
St. Thomas University (FL) | 0.866025 | 137476 | ||
Westminster College | 1.0 | Westminster College (MO) | 1.000000 | 179946 |
Westminster College (PA) | 1.000000 | 216807 | ||
Westminster College (UT) | 1.000000 | 230807 | ||
2.0 | Utah College of Massage Therapy-Westminster (CO) | 0.577350 | 469151 | |
3.0 | Hussian College-Daymar College Nashville (TN) | 0.534522 | 220002 | |
Wheaton College | 1.0 | Wheaton College (IL) | 1.000000 | 149781 |
Wheaton College (MA) | 1.000000 | 168281 | ||
2.0 | Hussian College-Daymar College Clarksville (TN) | 0.534522 | 368443 | |
Hussian College-Daymar College Columbus (OH) | 0.534522 | 205559 | ||
Hussian College-Daymar College Murfreesboro (TN) | 0.534522 | 444255 |
There were still 913 perfect merges, but the accuracy on imputing the remaining 19 unitids is not as high: we only got 13 of these (accuracy of 68% within the nonperfect matches, 99.4% for the whole dataset).
For example, Calumet College of St. Joseph merged with College of St. Joseph (VT) when it should have merged with Calumet College of Saint Joseph (IN). Plus, we must now account for institutions with the same name, such as Wheaton College (IL and MA). Still, not terrible, but it is alwasy a good idea to check for what unitids were imputed. Luckily this module makes it easy to do that.
Conclusion
I hope this can be a valuable tool for higher ed researchers out there. While this is currently only available in Python, I hope to get around to making it available on Stata and R soon. Please leave any feedback or bug reports on my GitHub repository here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file unitids-1.0.1.tar.gz
.
File metadata
- Download URL: unitids-1.0.1.tar.gz
- Upload date:
- Size: 128.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6917b5cce07b43244ef61d8fc700b261376aff265c40f2d90d5d0b54514cb2cf |
|
MD5 | f198c819844a4c865c903419f7731e10 |
|
BLAKE2b-256 | 0f8f2913a0da8051e43131e030ee5a7811077c22b9b937ac19612a18c8991313 |
File details
Details for the file unitids-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: unitids-1.0.1-py3-none-any.whl
- Upload date:
- Size: 114.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d239a2cc3ec53a049c0bd5d8afef6aa880ffdc4ac0ac9f8a2d5e932e68fd49c |
|
MD5 | 935d9ec2032c554b1099037040d02f81 |
|
BLAKE2b-256 | a4fe1e56aad551e275be2adda9225a2258aa6f323efe4bfebea940f58c9a272d |