Skip to main content

Big Local News Tools

Project description

Big Local News Tools for Journalists

Harmonizer: attempts to standardize data.
Labeler: machine learning assisted data labeling / categorization.
PowerBI: tools for scraping PowerBI dashboards.

Harmonizer

The harmonizer attempts to standardize data. For instance, a column of data like this:

Apple Inc.
APPLE Inc.
APPLE INC
APPLE

Would be standardized to "Apple Inc.", so all four entries would have the same value. The methodology, usage, and examples are below. Understanding the methodology will help to understand how it is used.

Methodology:

  • Harmonizing a column consists of two phases:
    1. OPTIONAL: Identify "stop words." Stop words are commonly occurring words that carry little semantic value; in normal language, these are words like "a", "the", "of", etc., but in the context of something like corporate names, they may be words like "LLC", "CO", "INTERNATIONAL", "GROUP", "DBA", etc. Identifying and removing these reduces the similarity between unrelated companies, i.e. "ACME INTERNATIONAL" and "APPLE INTERNATIONAL" might be ~50% similar, but once you strip "INTERNATIONAL" from the names, they are 0% similar, which is most often what is desired.
    2. Standardize the names; this consists of several steps:
    • clean the target column:
      • uppercase all tokens (words)
      • remove punctuation
      • remove stop words (loaded from stop_words.csv generated in optional step 1; if this file doesn't exist, it doesn't remove any stop words)
    • sort the target column (allows this algorithm to run in O(1) time)
    • compare the current value to the previous value and calculate their similarity (this program uses the harmonic mean of the partial ratio and the sorted token ratio, see the python library fuzzywuzzy for the meaning of these)
    • if the similarity is above the given threshold, it assigns the same harmonizer_id to the value, otherwise it creates a new ID
    • identify the longest cleaned name for each group by ID and assign the original name to that group

Use:

  1. Create a stop_words.csv file: harmonizer stop_words <csv_name> <csv_column>
  2. Harmonize the desired field: harmonizer harmonize <csv_name> <csv_column> -t 0.85
  • the -t 0.85 is optional and specifies a threshold between 0 and 1, with values closer to 1 requiring a stricter match in order to assign the same ID

Help:

  • General help: harmonizer -h
  • Stop words help: harmonizer stop_words -h
  • Harmonize help: harmonizer harmonize -h

Examples:

  • cd <package-dir>/harmonizer/examples # change into examples directory
  • H1B data:
    • harmonizer stop_words h1b_datahubexport-2019.csv Employer
      • outputs: stop_words.csv
    • harmonizer harmonize h1b_datahubexport-2019.csv Employer
      • time: requires about ~18s on a normal laptop
      • uses: stop_words.csv
      • outputs: h1b_datahubexport-2019_harmonized.csv
  • WARN data:
    • harmonizer stop_words Alaska_warn_raw.csv 'Company Name'
      • outputs: stop_words.csv
    • harmonizer harmonize Alaska_warn_raw.csv 'Company Name'
      • time: requires about ~1s on a normal laptop
      • uses: stop_words.csv
      • outputs: Alaska_warn_raw_harmonized.csv

Tuning:

  • This script outputs the original file with the following columns added:
    • <column>_harmonizer_cleaned: contains the cleaned version of the target column
    • <column>_harmonizer_score: contains the similarity score that compares the current row to the previous row
    • <column>_harmonizer_id: contains the assigned harmonizer ID
    • <column>_harmonizer_standardized: contains the standardized value
  • Look at the <column>_harmonizer_score, which represents the similarity between the current and previous rows' values; you can raise or lower the threshold with the -t <value> argument, i.e. raise it if you think two things shouldn't be a match and lower it if you think two things should be a match

Caveats:

  • This measure is not perfect; for instance, these companies probably will not be identified as the same (although this doesn't appear to happen often in H1B data):
    • ACME GROUP / SPECIAL DIVISION X
    • ACME GROUP / REAL ESTATE
    • ACME GROUP / AGRICULTURE

Alternatives:

  • Attempted to use this approach, but found that using a similarity matrix and affinity propagation doesn't work, except for very small datasets (i.e. H1B data with ~51,000 rows crashes a pretty decent computer); their algorithm runs in space and time of O(N^2), while the one implemented here runs in O(N)

Labeler

This tool helps label free form text. It takes a csv with raw text and a list of labels and uses user input to train a model periodically, which then predicts labels for unlabeled texts.

Help

  • labeler -h

Use

  • labeler start <csv> <text-column-name> <label-1> <label-2>...[OPTIONS]
  • labeler resume <checkpoint-pkl-path>

Examples

  • cd <package-dir>/labeler/examples # change into examples directory
  • labeler start examples/contraband.csv contraband alcohol drugs other weapons
    • runs labeler on contraband.csv with labels alcohol, drugs, other, and weapons
  • labeler start examples/contraband.csv contraband alcohol drugs other weapons -xor
    • runs labeler on contraband.csv with labels alcohol, drugs, other, and weapons; however, this time, each record can only have 1 label, i.e. the labels are mutually exclusive
  • follow the menu to label text or use the model to automatically label the remaining texts, and then to save the labeled texts as a csv
  • this program also permits saving checkpoints, so labeling can be resumed at a later point; this option can be accessed from the main menu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bln-tools-0.0.2.tar.gz (21.3 kB view details)

Uploaded Source

Built Distribution

bln_tools-0.0.2-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file bln-tools-0.0.2.tar.gz.

File metadata

  • Download URL: bln-tools-0.0.2.tar.gz
  • Upload date:
  • Size: 21.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.8

File hashes

Hashes for bln-tools-0.0.2.tar.gz
Algorithm Hash digest
SHA256 b50d5d41c3cb54e58835d6954b819af246f81c07d2f38b3ae9ca854ed085c5f7
MD5 d1c4cfbc84aa90c1fbb37a1afeb76e24
BLAKE2b-256 bd94db2ccffbcb9e22edfb860d3565746b487597bfb847b3503759a962ebc4a9

See more details on using hashes here.

File details

Details for the file bln_tools-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: bln_tools-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 24.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.8

File hashes

Hashes for bln_tools-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bdb0aa17406c129db08eb36ba3dbe37b84aadd1363282d3e4e2f5efd6ae10b0c
MD5 52ca2e3dec516072d00a552e064c5e5e
BLAKE2b-256 e3019febc56ae6f688942127b03f11803ec2766a3b0d1a2dfc78a024fe15b61f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page