Street Address Cleaning Utility
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
- python-Levenshtein 0.10.2
- fuzzy 1.0
- GDAL 1.6 or higher
Ri-PASS was born in a GIS environment focused on Public Health. Researchers were constantly geocoding patient records from bad data. Bad data? Addresses that suffered from poor data entry, rampant spelling errors or the occassional “Data entry person fell asleep at the keyboard” string (e.g. 123 $%#*Johnson St.) Now millions of records suffering from these issues and you’ve got geocoding issues.
Ri-PASS follows a few specific ideas:
- Don’t just use your street centerline file or data file (SHP or CSV) to geocode only and hope your default software package will find the address errors. Use that street centerline file to help clean it first!
- Local knowledge should be rewarded. If your data contains place names that are obvious to you then make them obvious to the computer!
- String matching, NLP, phonetic matching….these are all fuzzy components. So one method usually won’t fit all situations. So every method should be available.
How to Use (Quick Start)
Ri-PASS starts with loading a street centerline file and/or a csv to give Ri-PASS local street knowledge and/or local place name knowledge. Below we create a class instance to load and read a shapefile. We then point the method loadAddressList to the shapefile and provide the field name that gives the full street name (this is best if this includes direction prefixes and suffixes, e.g., “E Smith St”).
>>>roads = ripass.LoadSHP()
>>>roads = roads.loadAddressList(<path to shp>, <full street name field>)
Ri-PASS reads the streets and calculate frequency of street name occurrence into a quantile distribution. The “roads” instance returns four lists:
- roads = Most commonly occurring roads by distribution
- roads = Second most common
- roads = Third most common
- roads = Fourth most commoon
The RiPASS function will read the list of lists to compare all distributions.
Next, you can provide a csv with place names and matching addresses to Ri-PASS (optional):
>>>places = ripass.LoadCSV()
>>>places = places.loadPlaceNameDict(<csv path>, <place Name field>,<address field>)
Now you are ready to run Ri-PASS (RiPASS function) on an address:
>>>ripass.RiPASS(“123 ,Johnson Apt 45”,roads,places)
“123 Johnson St”
The RiPASS function has the following parameters:
- address = The address you want to examine
- addressList = The potential candidates created above
- placeNameList (optional) = The list of place names and addresses. Default = None
- method = The string/phonetic matching method you want to use:
- (default) = Levenshtein distance
- = Phonetic matching using NYSIIS and Levenshtein distance
- = Jaro-Winkler distance
- = Phonetic matching using NYSIIS and Jaro-Winkler distance
- = Difflib.SequenceMatcher used
- = Phonetic matching using NYSIIS and Difflib.SequenceMatcher
- base = The base score you are willing to accept for a match candidate. Default = 0.7
All functions used in the RiPASS function are available for users to redesign how addresses are cleaned. Please read the docstring for each function to understand it’s use.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|File Name & Checksum SHA256 Checksum Help||Version||File Type||Upload Date|
|Ri-PASS-1.1.win32.exe (203.2 kB) Copy SHA256 Checksum SHA256||any||Windows Installer||Jan 25, 2013|
|Ri-PASS-1.1.zip (4.6 kB) Copy SHA256 Checksum SHA256||–||Source||Jan 25, 2013|