Skip to main content

A script to extract US-style street addresses from a text file.

Project description

A script to extract US-style street addresses from a text file

$ address_extractor
1600 Pennsylvania Ave NW, Washington, DC 20500 ^D
1 lines in input
,1600 Pennsylvania Ave NW,Washington DC 20500
$ address_extractor -o output.csv input.csv
4361 lines in input
*snip*
11 lines unable to be parsed
$ ls
output.csv

address_extractor takes text or a text file containing address-like data, one address per line, and parses it into a uniform format with the usaddress package.

Installation

This package is available from PyPi via pip:

pip install address_extractor

This will install the module as well as the command-line script as address_extractor.

Command-line Usage

address_extractor [-h] [-o OUTPUT] [--remove-post-zip] [input]

positional arguments:
  input                 the input file. Defaults to stdin.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        the output file. Defaults to stdout.
  --remove-post-zip, -r
                        when scanning the input lines, remove everything after
                        a sequence of 5 digits followed by a comma. The
                        parsing library used by this script chokes on
                        addresses containing this kind of information, often a
                        county name.

Lines that could not be parsed will be printed to STDERR. They can be saved to a file with standard bash redirection techniques:

$ address_extractor -o good_addresses.csv has_some_bad_addresses.txt 2> bad_addresses.txt

Usage as a Module

address_extractor can be used as a Python module:

>>> import address_extractor
>>> address_extractor.main(input=input_file_object, output=output_file_object, remove_post_zip=a_bool)

There are some small issues with this implementation:

  • If using sys.stdin or sys.stdout for input or output, respectively, the file objects will still be closed. This presents issues trying to use these in the future.

  • Errored lines are still printed to sys.stderr which may not be expected.

Versions and Stability

This package is uploaded as a 0.1.0 release. There are no tests and little error checking–it originated as a quick-‘n-dirty script and I decided to release it as a package to gain familiarity with that process.

Issues, comments, and pull requests are welcome at the GitHub page!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

address_extractor-0.1.0.post1.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

address_extractor-0.1.0.post1-py3-none-any.whl (4.7 kB view details)

Uploaded Python 3

File details

Details for the file address_extractor-0.1.0.post1.tar.gz.

File metadata

File hashes

Hashes for address_extractor-0.1.0.post1.tar.gz
Algorithm Hash digest
SHA256 fd62961905c9bae63223b4624116571953a4a83b7c2b473ef3d4a51eccd40a9d
MD5 c77704784852d775587cf51c100f104d
BLAKE2b-256 198aa0b8e6676e7fe5a7d553a44d90608236236aeac7d72505b717b2049179b1

See more details on using hashes here.

File details

Details for the file address_extractor-0.1.0.post1-py3-none-any.whl.

File metadata

File hashes

Hashes for address_extractor-0.1.0.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 b5983231c123a92d0b6fb8c45ffb763d102b6692fc007fe34e2ac486902afe6b
MD5 0aabfd9ff75d960de3ca5ad311e67132
BLAKE2b-256 40a96e20672d4bbaa5e7c3871ca29429d1703acea6312159225f564657da23ad

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page