Skip to main content

automatically tag news articles with justice-related categories and extract location information

Project description

Build Status

tagnews

tagnews is a Python library that can

  • Automatically categorize the text from news articles with type-of-crime tags, e.g. homicide, arson, gun violence, etc.
  • Automatically extract the locations discussed in the news article text, e.g. "55th and Woodlawn" and "1700 block of S. Halsted".
  • Retrieve the latitude/longitude pairs for said locations using an instance of the pelias geocoder hosted by CJP.
  • Get the community areas those lat/long pairs belong to using a shape file downloaded from the city data portal parsed by the shapely python library.

Sound interesting? There's example usage below!

You can find the source code on GitHub.

Installation

You can install tagnews with pip:

pip install tagnews

NOTE: You will need to install some NLTK packages as well:

>>> import nltk
>>> nltk.download('punkt_tab')
>>> nltk.download('wordnet')

Beware, tagnews requires python >= 3.9.

Example

The main classes are tagnews.CrimeTags and tagnews.GeoCoder.

>>> import tagnews
>>> crimetags = tagnews.CrimeTags()
>>> article_text = ('The homicide occurred at the 1700 block of S. Halsted Ave.'
...   ' It happened just after midnight. Another person was killed at the'
...   ' intersection of 55th and Woodlawn, where a lone gunman')
>>> crimetags.tagtext_proba(article_text)
HOMI     0.739159
VIOL     0.146943
GUNV     0.134798
...
>>> crimetags.tagtext(article_text, prob_thresh=0.5)
['HOMI']
>>> geoextractor = tagnews.GeoCoder()
>>> prob_out = geoextractor.extract_geostring_probs(article_text)
>>> list(zip(*prob_out))
[..., ('at', 0.0044685714), ('the', 0.005466637), ('1700', 0.7173856),
 ('block', 0.81395197), ('of', 0.82227415), ('S.', 0.7940061),
 ('Halsted', 0.70529455), ('Ave.', 0.60538065), ...]
>>> geostrings = geoextractor.extract_geostrings(article_text, prob_thresh=0.5)
>>> geostrings
[['1700', 'block', 'of', 'S.', 'Halsted', 'Ave.'], ['55th', 'and', 'Woodlawn,']]
>>> coords, scores = geoextractor.lat_longs_from_geostring_lists(geostrings)
>>> coords
         lat       long
0  41.859021 -87.646934
1  41.794816 -87.597422
>>> scores # confidence in the lat/longs as returned by pelias, higher is better
array([0.878, 1.   ])
>>> geoextractor.community_area_from_coords(coords)
['LOWER WEST SIDE', 'HYDE PARK']

Limitations

This project uses Machine Learning to automate data cleaning/preparation tasks that would be cost and time prohibitive to perform using people. Like all Machine Learning projects, the results are not perfect, and in some cases may look just plain bad.

We strived to build the best models possible, but perfect accuracy is rarely possible. If you have thoughts on how to do better, please consider reporting an issue, or better yet contributing.

How can I contribute?

Great question! Please see CONTRIBUTING.md.

Problems?

If you have problems, please report an issue. Anything that is behaving unexpectedly is an issue, and should be reported. If you are getting bad or unexpected results, that is also an issue, and should be reported. We may not be able to do anything about it, but more data rarely degrades performance.

Background

We want to compare the amount of different types of crimes are reported in certain areas vs. the actual occurrence amount in those areas. In essence, are some crimes under-represented in certain areas but over-represented in others? This is the main question driving the analysis.

This question came from the Chicago Justice Project. They have been interested in answering this question for quite a while, and have been collecting the data necessary to have a data-backed answer. Their efforts include

  1. Scraping RSS feeds of articles written by Chicago area news outlets for several years, allowing them to collect almost half a million articles.
  2. Organizing an amazing group of volunteers that have helped them tag these articles with crime categories like "Gun Violence" and "Drugs", but also organizations such as "Cook County State's Attorney's Office", "Illinois State Police", "Chicago Police Department", and other miscellaneous categories such as "LGBTQ", "Immigration".
  3. The web UI used to do this tagging was also recently updated to allow highlighting of geographic information, resulting in several hundred articles with labeled location sub-strings.

Most of the code for those components can be found here.

A group actively working on this project meets every Tuesday at Chi Hack Night.

See Also

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tagnews-1.5.0.tar.gz (70.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tagnews-1.5.0-py3-none-any.whl (72.1 MB view details)

Uploaded Python 3

File details

Details for the file tagnews-1.5.0.tar.gz.

File metadata

  • Download URL: tagnews-1.5.0.tar.gz
  • Upload date:
  • Size: 70.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tagnews-1.5.0.tar.gz
Algorithm Hash digest
SHA256 c1a792d332714f1bdda2672cf8d9ba35fbc6158438b0d824eadf6329f09ed8f0
MD5 bf4dea775e67a4145f2e2f15fc731a88
BLAKE2b-256 e31bea77bf7ecc4406a4200700a84b23cab0c4b4fc24bc8dc078b34f1940e8a4

See more details on using hashes here.

Provenance

The following attestation bundles were made for tagnews-1.5.0.tar.gz:

Publisher: publish.yml on chicago-justice-project/article-tagging

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tagnews-1.5.0-py3-none-any.whl.

File metadata

  • Download URL: tagnews-1.5.0-py3-none-any.whl
  • Upload date:
  • Size: 72.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tagnews-1.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 33de50634fc82a882f7ef3c63ac0592500ebfd71e8013c6aa9a1cf3c0c2092f6
MD5 3ca7a33f40fdfaa17fbd6b0432ef8db7
BLAKE2b-256 9e2a3ceb0ce24362e4df172c7bb95ac96bd78d25565fc406988f7fa60e6c86a8

See more details on using hashes here.

Provenance

The following attestation bundles were made for tagnews-1.5.0-py3-none-any.whl:

Publisher: publish.yml on chicago-justice-project/article-tagging

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page