Skip to main content

Algorithms for Pillar. Currently includes "mini" algorithms, nothing too sophisticated.

Project description

Table of Contents

  1. Build and Publish
  2. Background
    1. Algorithms
    2. Datasets
  3. Current Goal
  4. Long Term Goal

Build

To build and publish this package we are using the poetry python packager. It takes care of some background stuff that led to mistakes in the past.

Folder structure:

|-- pypi
    |-- pillaralgos
        |-- helpers
            |-- __init__.py
            |-- data_handler.py
            |-- graph_helpers.py
            |-- sanity_checks.py
        |-- __init__.py  # must include version number
        |-- algoXX.py
    |-- LICENSE
    |-- README.md
    |-- pyproject.toml  # must include version number

To publish just run the poetry publish --build command after update version numbers as needed.

Background

Pillar is creating an innovative way to automatically select and splice clips from Twitch videos for streamers. This repo is focusing on the algorithm aspect. Three main algorithms are being tested.

Algorithms

  1. Algorithm 1: Find the best moments in clips based on where the most users participated. Most is defined as the ratio of unique users during a 2 min section to unique users for the entire session.
  2. Algorithm 2 Find the best moments in clips based on when rate of messages per user peaked. This involves answering the question "at which 2 min segment do the most users send the most messages?". If users X, Y, and Z all send 60% of their messages at timestamp range delta, then that timestamp might qualify as a "best moment"
    1. NOTE: Currently answers the question "at which 2 min segment do users send the most messages fastest"
  3. Algorithm 3 (WIP) Weigh each user by their chat rate, account age, etc. Heavier users predicted to chat more often at "best moment" timestamps
    1. STATUS: current weight determined by (num_words_of_user/num_words_of_top_user)
    2. Algorithm 3.5 Finds the best moments in clips based on most number of words/emojis/both used in chat

Datasets:

  1. Preliminary data prelim_df: 545 rows representing one 3 hour 35 minute 26 second twitch stream chat of Hearthstone by LiiHS
    • Used to create initial json import and resulting df clean/merge function organize_twitch_chat
  2. Big data big_df: 2409 rows representing one 7 hour 37 minute, 0 second twitch stream chat of Hearthstone by LiiHS
    • Used to create all algorithms

Current Goal

To create one overarching algorithm that will find the most "interesting" clips in a twitch VOD. This will be created through the following steps:

  1. Creation of various algorithms that isolate min_ (2 by default) minute chunks. The basic workflow:
    1. Create variable (ex: num_words, for number of words in the body of a chat message)
    2. Group df by min_ chunks, then average/sum/etc num_words for each min_ chunks
    3. Sort new df by num_words, from highest "value" to lowest "value"
    4. Return this new df as json (example)
  2. Users rate clips provided by each algorithm
  3. Useless algorithms thrown away
  4. Rest of the algorithms merged into one overarching algorithm, with weights distributed based on user ratings

Long Term Goal

  • New objective measure: community created clips (ccc) for a given VOD id with start/end timestamps for each clip
  • Assumption: ccc are interesting and can be used to create a narrative for each VOD. We can test this by cross referencing with posts to /r/livestreamfails upvotes/comments
  • Hypothesis: if we can predict where ccc would be created, those are potentially good clips to show the user
    • Short term test: Create a model to predict where ccc would be created using variables such as word count, chat rate, emoji usage, chat semantic analysis. We can do this by finding timestamps of ccc and correlating them with chat stats
    • Medium term test: Use top 100 streamers as training data. What similarities do their ccc and reddit most upvoted of that VOD share? (chat rate etc)
      1. Get the transcript for these top 100
      2. Get the top 100's YT posted 15-30min story content for the 8 hour VOD
      3. Get the transcript for that story content
      4. Semantic analysis and correlations, etc.
    • Long term test: what percentage of clips do our streamers actually end up using

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pillaralgos-1.0.6.tar.gz (14.2 kB view hashes)

Uploaded Source

Built Distribution

pillaralgos-1.0.6-py3-none-any.whl (16.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page