Skip to main content

AI-powered understanding of headlines and story text

Project description

AI-powered understanding of headlines and story text


tldrstory is a framework for AI-powered understanding of headlines and text content related to stories. tldrstory applies zero-shot labeling over text, which allows dynamically categorizing content. This framework also builds a txtai index that enables text similarity search. A customizable Streamlit application and FastAPI backend service allows users to review and analyze the data processed.

tldrstory has a corresponding Medium article that covers the concepts in this README and more. Check it out!

Examples

The following links are example applications built with tldrstory. These demos can also be found on https://tldrstory.com

Installation

The easiest way to install is via pip and PyPI

pip install tldrstory

You can also install tldrstory directly from GitHub. Using a Python Virtual Environment is recommended.

pip install git+https://github.com/neuml/tldrstory

Python 3.6+ is supported

See this link to help resolve environment-specific install issues.

Configurating an application

Once installed, an application must be configured to run. A tldrstory application consists of three separate processes:

  • Indexing
  • API backend
  • Streamlit application

This section will show how to start the "Sports News" application.

  1. Download the “Sports News” application configuration.
mkdir sports

wget https://raw.githubusercontent.com/neuml/tldrstory/master/apps/sports/app.yml -O sports/app.yml

wget https://raw.githubusercontent.com/neuml/tldrstory/master/apps/sports/api.yml -O sports/api.yml

wget https://raw.githubusercontent.com/neuml/tldrstory/master/apps/sports/index-simple.yml -O sports/index.yml

wget https://raw.githubusercontent.com/neuml/tldrstory/master/src/python/tldrstory/app.py -O sports/app.py
  1. Start the indexing process.
python -m tldrstory.index sports/index.yml
  1. Start the API process.
CONFIG=sports/api.yml API_CLASS=tldrstory.api.API uvicorn "txtai.api:app" &
  1. Start Streamlit.
streamlit run sports/app.py sports/app.yml "Sports" "🏆"
  1. Open a web browser and go to http://localhost:8501

Custom Sources

Out of the box, tldrstory supports reading data from RSS and the Reddit API. Additional data sources can be defined and configured.

The following shows an example custom data source definition. neuspo is a real-time sports event and news application. This data source loads 4 pre-defined entries into the articles database.

from tldrstory.source.source import Source

class Neuspo(Source):
    """
    Articles have the following schema:
        uid - unique id
        source - source name
        date - article date
        title - article title
        url - reference url for data
        entry - entry date
    """

    def run(self):
        # List of articles created
        articles = []

        articles.append(self.article("0", "Neuspo", self.now(), "Eagles defeat the Giants 22 - 21", 
                                     "https://neuspo.com/stream/34952e3919d685982c17735018b0197f", self.now()))

        articles.append(self.article("1", "Neuspo", self.now(), "Giants lose to the Eagles 22 - 21", 
                                     "https://t.co/e9FFgo0wgR?amp=1", self.now()))

        articles.append(self.article("2", "Neuspo", self.now(), "Rays beat Dodgers 6 to 4", 
                                     "https://neuspo.com/stream/6cb820b3ebadc086aa36b5cc4a0f575d", self.now()))

        articles.append(self.article("3", "Neuspo", self.now(), "Dodgers drop Game 2, 6-4", 
                                     "https://t.co/1hEQAShVnP?amp=1", self.now()))

        return articles

Let’s re-run the steps above using neuspo as the data source. First remove the sports/data directory, to ensure we create a fresh database. We can then download the gist above into the sports directory.

# Delete the sports/data directory before running

wget https://gist.githubusercontent.com/davidmezzetti/9a6064d9a741acb89bd46eba9f906c26/raw/7058e97da82571005b2654b4ab908f25b9a04fe2/neuspo.py -O sports/neuspo.py

Edit sports/index.yml and remove the rss section. Replace it with the following.

# Custom data source for neuspo
source: sports.neuspo.Neuspo

Now re-run steps 2–4 from the instructions above.

Parameter Reference

The following sections define configuration parameters for each process that is part of a tldrstory application.

Indexing

Configures the indexing of content. Currently supports pulling data via the Reddit API, RSS and custom user-defined sources.

name

name: string

Application name

schedule

schedule: string

Cron-style string that enables scheduled running of the indexing job. See this link for more information on cron strings.

sources

Data source configuration.

reddit

reddit.subreddit: name of subreddit to pull from 
reddit.sort: sort type
reddit.time: time range
reddit.queries: list of text queries to run

Runs a series of Reddit API queries. A Reddit API key will need to be created and configured for this method to work. Authentication parameters can be set within the enviroment or in a praw.ini file. See this link for more information on setting up a Reddit API account, read-only access is all that is needed.

See PRAW documentation for more details on how to configure the query settings.

rss

rss: list of RSS urls

Reads a series of RSS feeds and builds articles for each article link found.

source

source: string

Configures a custom source. This parameter takes a full class path as a string, for example "tldrstory.source.rss.RSS"

Custom sources can be use any data that has a date, text string and reference url. See the documentation in source.py for information on how to create a custom source. rss.py and reddit.py are example implementations.

ignore

ignore: list of url patterns

List of url patterns to ignore. Supports strings and regular expressions.

labels

labels: dict

Label configuration for zero-shot classifier. This configuration sets a category along with a list of topic values.

Example:

labels:
  topic:
    values: [Label 1, Label 2]

The example above configures the category "Topic" with two possible labels, "Label 1" and "Label 2". Any label can be set here and a large-scale NLP model will be used to categorize input text into those labels.

path

path: string

Where to store model output, path will be created if it doesn't already exist.

embeddings

embeddings: dict

Configures a txtai index used for searching topics. See txtai configuration for more details on this.

API

Configures a FastAPI backed interface for pulling indexed data.

path

path: string

Path to a model index.

Application

The default application is powered by Streamlit and driven by a YAML configuration file. The configuration file sets the application name, API endpoint for pulling content, and component configuration. A custom Streamlit application or any other application can be used in place of this to pull content from the API endpoint directly.

name

name: string

Application name

api

api: url

API endpoint for pulling content.

layout

description: string

Markdown string that is used to build a sidebar description.

queries

queries.name: Queries drop down header
queries.values: List of values to use for queries drop down

Configures the query drop down box. This should be a list of pre-canned queries to use. If a value of "Latest" is present, it will query for the last N articles. If a value of "--Search--" is present, it will present another text box to allow entering custom queries.

filters

filters: list

List of slider filters. This should map to the zero-shot labels configured in the indexing section.

chart

chart.name: Chart name
chart.x: Chart x-axis column
chart.y: Chart y-axis column
chart.scale: Color scale for list of colors
chart.colors: List of colors

Allows configuration of a scatter plot that graphs two label points. This chart can be used to plot and apply coloring to applied labels.

table

"column name": dynamic range of coloring

Data table that shows result details. In addition to default columns, this section allows adding additional columns based on the zero-shot labels applied. The default mode is to show the numeric value of the label but a range of text labels can also be applied.

For example:

  • [0, 5.0, Label 1, "color: #F00"]
  • [5.0, 10.0, Label 2, "color: #0F0"]

The above would output the text "Label 1" in red for values between 0 and 5. Values between 5 and 10 would output the text "Label 2" in green.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tldrstory-1.4.0.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

tldrstory-1.4.0-py3-none-any.whl (20.1 kB view details)

Uploaded Python 3

File details

Details for the file tldrstory-1.4.0.tar.gz.

File metadata

  • Download URL: tldrstory-1.4.0.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.11

File hashes

Hashes for tldrstory-1.4.0.tar.gz
Algorithm Hash digest
SHA256 eba73978ade8936cfc1112d4a79a668c362a06bf7f40030eb45477b14f0dfdbc
MD5 d43b72bb74a72f698bb1f8e2afda89ca
BLAKE2b-256 d2b11fb15e8c35fd2c2c9bb1bfee572ca6c0f4b4170634fe7297cc3d2b706825

See more details on using hashes here.

File details

Details for the file tldrstory-1.4.0-py3-none-any.whl.

File metadata

  • Download URL: tldrstory-1.4.0-py3-none-any.whl
  • Upload date:
  • Size: 20.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.11

File hashes

Hashes for tldrstory-1.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dde3938e3b87cdcea557e214d598c3e88ce7729cad0309e84de0217c3b7ceb2f
MD5 7376caa658494de905473f875a698cbd
BLAKE2b-256 105a2730b9da66a9188647ccf2f513ff2f5f476d19a98b49db4cdc21f74bc354

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page