A script to extract US-style street addresses from a text file.
Project description
A script to extract US-style street addresses from a text file
$ address_extractor 1600 Pennsylvania Ave NW, Washington, DC 20500 ^D 1 lines in input ,1600 Pennsylvania Ave NW,Washington DC 20500 $ address_extractor -o output.csv input.csv 4361 lines in input *snip* 11 lines unable to be parsed $ ls output.csv
address_extractor takes text or a text file containing address-like data, one address per line, and parses it into a uniform format with the usaddress package.
Installation
This package is available from PyPi via pip:
pip install address_extractor
This will install the module as well as the command-line script as address_extractor.
Command-line Usage
address_extractor [-h] [-o OUTPUT] [--remove-post-zip] [input]
positional arguments:
input the input file. Defaults to stdin.
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
the output file. Defaults to stdout.
--remove-post-zip, -r
when scanning the input lines, remove everything after
a sequence of 5 digits followed by a comma. The
parsing library used by this script chokes on
addresses containing this kind of information, often a
county name.
Lines that could not be parsed will be printed to STDERR. They can be saved to a file with standard bash redirection techniques:
$ address_extractor -o good_addresses.csv has_some_bad_addresses.txt 2> bad_addresses.txt
Usage as a Module
address_extractor can be used as a Python module:
>>> import address_extractor >>> address_extractor.main(input=input_file_object, output=output_file_object, remove_post_zip=a_bool)
There are some small issues with this implementation:
If using
sys.stdinorsys.stdoutfor input or output, respectively, the file objects will still be closed. This presents issues trying to use these in the future.Errored lines are still printed to
sys.stderrwhich may not be expected.
Versions and Stability
This package is uploaded as a 0.1.0 release. There are no tests and little error checking–it originated as a quick-‘n-dirty script and I decided to release it as a package to gain familiarity with that process.
Issues, comments, and pull requests are welcome at the GitHub page!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file address_extractor-0.1.0.post1.tar.gz.
File metadata
- Download URL: address_extractor-0.1.0.post1.tar.gz
- Upload date:
- Size: 4.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd62961905c9bae63223b4624116571953a4a83b7c2b473ef3d4a51eccd40a9d
|
|
| MD5 |
c77704784852d775587cf51c100f104d
|
|
| BLAKE2b-256 |
198aa0b8e6676e7fe5a7d553a44d90608236236aeac7d72505b717b2049179b1
|
File details
Details for the file address_extractor-0.1.0.post1-py3-none-any.whl.
File metadata
- Download URL: address_extractor-0.1.0.post1-py3-none-any.whl
- Upload date:
- Size: 4.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5983231c123a92d0b6fb8c45ffb763d102b6692fc007fe34e2ac486902afe6b
|
|
| MD5 |
0aabfd9ff75d960de3ca5ad311e67132
|
|
| BLAKE2b-256 |
40a96e20672d4bbaa5e7c3871ca29429d1703acea6312159225f564657da23ad
|