Skip to main content

automatically tag news articles with justice-related categories and extract location information

Project description

Build Status

Automatically classify news articles with type-of-crime tags? Neat! Also automatically extract location strings from the article? Even cooler!

>>> import tagnews
>>> crimetags = tagnews.CrimeTags()
>>> article_text = 'The homicide occurred at the 1700 block of S. Halsted Ave. It happened just after midnight. Another person was killed at the intersection of 55th and Woodlawn, where a lone gunman'
>>> crimetags.tagtext_proba(article_text)
HOMI     0.739159
VIOL     0.146943
GUNV     0.134798
...
>>> crimetags.tagtext(article_text, prob_thresh=0.5)
['HOMI']
>>> geoextractor = tagnews.GeoCoder()
>>> prob_out = geoextractor.extract_geostring_probs(article_text)
>>> list(zip(*prob_out))
[..., ('at', 0.0044685714), ('the', 0.005466637), ('1700', 0.7173856), ('block', 0.81395197), ('of', 0.82227415), ('S.', 0.7940061), ('Halsted', 0.70529455), ('Ave.', 0.60538065), ...]
>>> geostrings = geoextractor.extract_geostrings(article_text, prob_thresh=0.5)
>>> geostrings
[['1700', 'block', 'of', 'S.', 'Halsted', 'Ave.'], ['55th', 'and', 'Woodlawn,']]
>>> lat_longs, scores, num_found = geoextractor.lat_longs_from_geostring_lists(geostrings)
>>> lat_longs
[[41.49612808227539, -87.63743591308594], [41.79513222479058, -87.58843505219843]]
>>> scores # our best attempt at giving a confidence in the lat_longs, higher is better
array([0.5913217, 0.       ], dtype=float32)
>>> num_found # how many results gisgraphy found for the (post-processed) geostring
[8, 10]

The documentation for this project is a work in progress. If something is unclear, or worse yet, incorrect, please report that as an issue.

Installation

If you are wanting to install this to use as a package that can deliver NLP results out of the box, then please see INSTALLATION.md. If you are wanting to roll up your sleeves and do some data science, please see CONTRIBUTING.md.

Usage

Below are sample usages when you want to just use this as a library to make predictions.

From python

The main classes are tagnews.CrimeTags and tagnews.GeoCoder:

>>> import tagnews
>>> crimetags = tagnews.CrimeTags()
>>> article_text = 'The homicide occurred at the 1700 block of S. Halsted Ave. It happened just after midnight. Another person was killed at the intersection of 55th and Woodlawn, where a lone gunman'
>>> crimetags.tagtext_proba(article_text)
HOMI     0.739159
VIOL     0.146943
GUNV     0.134798
...
>>> crimetags.tagtext(article_text, prob_thresh=0.5)
['HOMI']
>>> geoextractor = tagnews.GeoCoder()
>>> prob_out = geoextractor.extract_geostring_probs(article_text)
>>> list(zip(*prob_out))
[..., ('at', 0.0044685714), ('the', 0.005466637), ('1700', 0.7173856), ('block', 0.81395197), ('of', 0.82227415), ('S.', 0.7940061), ('Halsted', 0.70529455), ('Ave.', 0.60538065), ...]
>>> geostrings = geoextractor.extract_geostrings(article_text, prob_thresh=0.5)
>>> geostrings
[['1700', 'block', 'of', 'S.', 'Halsted', 'Ave.'], ['55th', 'and', 'Woodlawn,']]
>>> lat_longs, scores, num_found = geoextractor.lat_longs_from_geostring_lists(geostrings)
>>> lat_longs
[[41.49612808227539, -87.63743591308594], [41.79513222479058, -87.58843505219843]]
>>> scores # our best attempt at giving a confidence in the lat_longs, higher is better
array([0.5913217, 0.       ], dtype=float32)
>>> num_found # how many results gisgraphy found for the (post-processed) geostring
[8, 10]

From the command line

The installation comes with a very rudimentary command line interface, which without any arguments defaults to reading from the stdin.

$ python -m tagnews.crimetype.cli
Go ahead and start typing. Hit ctrl-d when done.
<type here>

Or you can provide a list of articles to tag, a CSV of the probability of each tag is output to <article name>.tagged.

$ python -m tagnews.crimetype.cli sample-article-1.txt sample-article-2.txt
$ cat sample-article-1.txt.tagged
  CPD, 0.912382307
UNSPC, 0.051873838
 SEXA, 0.031065436
 BEAT, 0.023119570
 DRUG, 0.017140532
...

Note that the -m flag is required.

Background

We want to compare the amount different types of crimes are reported in certain areas vs. the actual occurrence amount in those areas. Are some crimes under-represented in certain areas but over-represented in others? To accomplish this, we'll need to be able to extract a type-of-crime tag and geospatial data from news articles.

We meet every Tuesday at Chi Hack Night, and you can find out more about this specific project here.

The Chicago Justice Project has been scraping RSS feeds of articles written by Chicago area news outlets for several years, allowing them to collect over 400,000 articles. At the same time, an amazing group of volunteers have helped them tag these articles. The tags include crime categories like "Gun Violence", "Drugs", "Sexual Assault", but also organizations such as "Cook County State's Attorney's Office", "Illinois State Police", "Chicago Police Department", and other miscellaneous categories such as "LGBTQ", "Immigration". The volunteer UI was also recently updated to allow highlighting of geographic information.

Contributing

You want to contribute? Great! Check out the CONTRIBUTING.md file for more info.

Areas of research

Type-of-Crime Article Tagging

This part of this project aims to automate the category tagging using a specific branch of Machine Learning known as Natural Language Processing.

Possible models to use (some of which we have tried!) include

It might be useful to have an additional corpus of news articles that we can use for unsupervised feature learning without having to worry about over-fitting.

Automated Geolocation

We also need to automatically find the geographic area of the crime the article is talking about. We have just recently updated the tagging interface to also allow highlighting geospatial information inside of articles and are collecting ground truth data. Once we have collected this data, we need to automate the process of detecting location information inside articles. An important note, we are relying on the power of current geocoders to take unstructured location information and output a latitude/longitude pair.

One possible path forward appeared to involve an approach developed by Everyblock. They got funding from the Knight Foundation to geolocate news articles and were required to open source their code. A brief investigation seems to show that their geolocating is actually just a giant Regular Expression. Investigation showed that it was not accurate enough on its own for our purposes.

Things to checkout:

Things to consider

Some articles may discuss multiple crimes. Some crimes may occur in multiple areas, whereas others may not be associated with any geographic information (e.g. some kinds of fraud).

See Also

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tagnews-1.1.0.tar.gz (84.5 MB view details)

Uploaded Source

File details

Details for the file tagnews-1.1.0.tar.gz.

File metadata

  • Download URL: tagnews-1.1.0.tar.gz
  • Upload date:
  • Size: 84.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for tagnews-1.1.0.tar.gz
Algorithm Hash digest
SHA256 13125f9cab51f547b0722c4b2f2257225a38f69eca25d0a70e5998ee7b7c1968
MD5 0d6ac223b23cc0339c69ce8cc514fabf
BLAKE2b-256 52ff60246d477b234a509ed5fdfc175a36725d9149d9274c73c536ff735d6cc6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page